Navigating User Behavior toward Personalized Multimodal Generation
Quick Answer
NaviGen enhances personalized multimodal content generation by transforming user interaction history into executable instructions, addressing the challenges of behavior encoding and instruction writing.
Quick Take
NaviGen enhances personalized multimodal content generation by transforming user interaction history into executable instructions, addressing the challenges of behavior encoding and instruction writing. The model improves image and video generation across various domains, yielding more relevant and visually generatable outputs.
Key Points
- NaviGen uses dual identifiers for behavioral and semantic representation.
- Implements a two-stage SFT+RL pipeline for preference reasoning.
- Demonstrated improvements in next-item prediction across domains.
- Enhances the specificity and relevance of generated instructions.
- Code available at GitHub for further research and development.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Modern AIGC pipelines deliver high-fidelity images and videos but presuppose a well-formed creation instruction, while end users rarely articulate visual details, leaving generators misaligned with user demand. We study personalized content generation, which turns a user's interaction history into an executable instruction for downstream synthesis, and identify two obstacles: behavior must be encoded in a form legible to language reasoning, and the model must acquire instruction-writing skill absent from both pretraining and behavior data. We propose NaviGen, which represents each item with a dual identifier coupling a collaborative code and a textual code as a behavioral substrate and a semantic bridge in one token stream. On this representation, a two-stage SFT+RL pipeline first distills preference reasoning and instruction writing from evolutionarily searched supervision, then aligns generation with user intent through hierarchical and self-consistent rewards. Experiments across product, game, and short-video domains show that NaviGen improves personalized image and video generation, strengthens next-item prediction, and yields more specific, relevant, and visually generatable instructions. Our code is anonymously released at: this https URL.
| Comments: | 16 pages, 15 figures, 5 tables. Code is available at this https URL |
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.24196 [cs.AI] |
| (or arXiv:2606.24196v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2606.24196 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Hengji Zhou [view email]
[v1]
Tue, 23 Jun 2026 06:31:21 UTC (3,769 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.