Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
Quick Take
The Ablate-to-Validate principle reveals that accuracy gains in VLMs may not indicate true reasoning with continuous tokens.
Key Points
- Introduces Token Replacement Test (TRT) for evaluating latent-token usage.
- Finds accuracy gains misleading in assessing reasoning capabilities.
- Recommends TRT as a standard diagnostic tool for VLMs.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.