MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

arXiv cs.CV·Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao, Mingdong Ou, Xuming Hu

3h ago

·~1 min·6/9/2026·en·0

Quick Answer

MM-Matryoshka introduces a budget-elastic 2D training framework for visual document retrieval, allowing flexible multi-vector retrieval without separate models for different budgets.

Quick Take

MM-Matryoshka introduces a budget-elastic 2D training framework for visual document retrieval, allowing flexible multi-vector retrieval without separate models for different budgets. It significantly reduces storage and computational costs while maintaining higher quality than traditional truncation methods.

Key Points

MM-Matryoshka enables ColPali-style multi-vector retrieval with budget elasticity.
The framework allows selecting a 2D budget at inference without extra training.
Experiments show higher quality retention compared to direct truncation baselines.
Significant reductions in storage and computational overhead are achieved.
Robust budget elasticity enhances efficiency in visual document retrieval.

Article Excerpt

From source RSS / original summary

arXiv:2606. 07654v1 Announce Type: new Abstract: Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth.

Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets.

Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

4d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup