COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models
Quick Answer
COMPASS introduces a unified multimodal framework for composition-intent control, enhancing both perception and generation through a shared expert token.
Quick Take
COMPASS introduces a unified multimodal framework for composition-intent control, enhancing both perception and generation through a shared expert token. It significantly improves composition understanding and generation consistency, outperforming strong baselines on a newly constructed dataset, Comp-11, which features 11 classes and reasoning-augmented annotations.
Key Points
- COMPASS integrates composition perception and generation in a single system.
- Utilizes a shared expert token, τ_c, for controlling composition intent.
- Introduces Comp-11, a large dataset with 11 classes for composition learning.
- Demonstrates substantial improvements in category-level composition understanding.
- Achieves more consistent and prompt-faithful generation compared to existing models.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation. We present COMPASS, the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided generation, with a shared expert token $\tau_c$ as the central intent anchor. On the perception side, COMPASS injects composition expertise into an MoE backbone in a minimally invasive manner and distills the inferred intent into $\tau_c$. On the generation side, COMPASS reuses $\tau_c$ as a global conditioning signal that steers the denoising trajectory, effectively converting passive composition analysis into explicit layout control. To support systematic instruction-following composition learning and evaluation at scale, we construct Comp-11, a large-scale dataset with an 11-class taxonomy and reasoning-augmented annotations. Extensive experiments show that COMPASS substantially improves category-level composition understanding and delivers more composition-consistent, prompt-faithful generation than strong baselines.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.28696 [cs.AI] |
| (or arXiv:2606.28696v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28696 arXiv-issued DOI via DataCite |
Submission history
From: Ziqi Zhou [view email]
[v1]
Sat, 27 Jun 2026 02:43:13 UTC (20,107 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.