TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

arXiv cs.AI·Jianpeng Cheng, Xian Wu, Jiangfan Zhang, Wentao Bao, Chaitanya Ahuja, Shlok Kumar Mishra, Hanchao Yu, Yang Gao, Fan Xia, Qi Guo, Shaodan Zhai, Xiangjun Fan, Jun Xiao

1d ago

·~2 min·5/19/2026·en·1

Quick Take

TTE-Flash introduces latent think tokens for efficient reasoning in multimodal representations.

Key Points

Replaces explicit CoT with latent think tokens.
Optimizes tokens using CoT generation and contrastive loss.
Outperforms explicit-CoT models on MMEB-v2 benchmark.

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

Authors:Jianpeng Cheng, Xian Wu, Jiangfan Zhang, Wentao Bao, Chaitanya Ahuja, Shlok Kumar Mishra, Hanchao Yu, Yang Gao, Fan Xia, Qi Guo, Shaodan Zhai, Xiangjun Fan, Jun Xiao

View PDF HTML (experimental)

Abstract:Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.16638 [cs.AI]
	(or arXiv:2605.16638v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.16638 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xian Wu [view email]
[v1] Fri, 15 May 2026 21:10:56 UTC (42,978 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.AI

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?