Structure over Pixels: Learning Variable-Length Visual Programs
Quick Take
STROP introduces a visual tokenizer that learns variable-length scene representations, enhancing structural understanding over pixel reconstruction.
Key Points
- STROP optimizes scene representation length dynamically.
- Focuses on structural over pixel-level details.
- Demonstrates compositional structure in learned vocab.
Article Content
From source RSS / original summaryarXiv:2605. 27696v1 Announce Type: new Abstract: Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure.
We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass.
By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer enables efficient, uncertainty-aware tuning of biomedical vision-language models with minimal parameter updates.