Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice
Quick Take
TaTok introduces adaptive image tokenization, enhancing efficiency and accuracy in processing image sequences.
Key Points
- Addresses redundancy and information loss in image tokenization.
- Introduces global tokens for improved mutual information modeling.
- Achieves 1.3x gFID improvement and 8.7x speedup.
📖 Reader Mode
~2 min readAbstract:Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.
| Comments: | 21 pages, 8 figures |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.16384 [cs.CV] |
| (or arXiv:2605.16384v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16384 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Xiusheng Huang [view email]
[v1]
Mon, 11 May 2026 10:51:02 UTC (1,789 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.