Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)
Quick Take
The EReL@MIR 2025 Multimodal Document Retrieval Challenge attracted 455 participants, focusing on a single retrieval system for closed-set document page retrieval and open-domain passage retrieval. The top three teams utilized decoder-based Multimodal-LLM embedders from the Qwen2-VL family, achieving close performance with fine-tuned ensembles and innovative training-free methods.
Key Points
- Challenge featured two tasks: MMDocIR and M2KR, with 586 submissions from 22 teams.
- Ranking based on macro-average of mean Recall@{1,3,5} across both tasks.
- Winning systems utilized Qwen2-VL family embedders, differing in ensemble and fusion strategies.
- Training-free system closely competed, finishing within 0.1 point of the fine-tuned winner.
Article Content
From source RSS / original summaryarXiv:2606. 04240v1 Announce Type: new Abstract: Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel.
The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR).
Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems.
All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0. 1$ point of the fine-tuned winner.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Optimal Transport Flow Matching by Design
The study presents a novel approach to optimal transport (OT) flow matching, reformulating the problem by treating the prior as a design choice. This method achieves over 2x reduction in trajectory curvature compared to existing methods, improving generation quality in few-step regimes without altering the flow model. The approach integrates seamlessly with latent-space models and classifier-free guidance.