COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models

arXiv cs.AI·Ziqi Zhou, Weize Quan, Mining Tan, Zhihan Chen, Dandan Zheng, Jingdong Chen, Jun Zhou, Weiming Dong, Dong-Ming Yan

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

COMPASS introduces a unified multimodal framework for composition-intent control, enhancing both perception and generation through a shared expert token.

Quick Take

COMPASS introduces a unified multimodal framework for composition-intent control, enhancing both perception and generation through a shared expert token. It significantly improves composition understanding and generation consistency, outperforming strong baselines on a newly constructed dataset, Comp-11, which features 11 classes and reasoning-augmented annotations.

Key Points

COMPASS integrates composition perception and generation in a single system.
Utilizes a shared expert token, τ_c, for controlling composition intent.
Introduces Comp-11, a large dataset with 11 classes for composition learning.
Demonstrates substantial improvements in category-level composition understanding.
Achieves more consistent and prompt-faithful generation compared to existing models.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

View PDF HTML (experimental)

Abstract:Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation. We present COMPASS, the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided generation, with a shared expert token $\tau_c$ as the central intent anchor. On the perception side, COMPASS injects composition expertise into an MoE backbone in a minimally invasive manner and distills the inferred intent into $\tau_c$. On the generation side, COMPASS reuses $\tau_c$ as a global conditioning signal that steers the denoising trajectory, effectively converting passive composition analysis into explicit layout control. To support systematic instruction-following composition learning and evaluation at scale, we construct Comp-11, a large-scale dataset with an 11-class taxonomy and reasoning-augmented annotations. Extensive experiments show that COMPASS substantially improves category-level composition understanding and delivers more composition-consistent, prompt-faithful generation than strong baselines.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.28696 [cs.AI]
	(or arXiv:2606.28696v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.28696 arXiv-issued DOI via DataCite

Submission history

From: Ziqi Zhou [view email]
[v1] Sat, 27 Jun 2026 02:43:13 UTC (20,107 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

5d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy