Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

3h ago

·~1 min·6/4/2026·en·0

Quick Take

The EReL@MIR 2025 Multimodal Document Retrieval Challenge attracted 455 participants, focusing on a single retrieval system for closed-set document page retrieval and open-domain passage retrieval. The top three teams utilized decoder-based Multimodal-LLM embedders from the Qwen2-VL family, achieving close performance with fine-tuned ensembles and innovative training-free methods.

Key Points

Challenge featured two tasks: MMDocIR and M2KR, with 586 submissions from 22 teams.
Ranking based on macro-average of mean Recall@{1,3,5} across both tasks.
Winning systems utilized Qwen2-VL family embedders, differing in ensemble and fusion strategies.
Training-free system closely competed, finishing within 0.1 point of the fine-tuned winner.

Article Content

From source RSS / original summary

arXiv:2606. 04240v1 Announce Type: new Abstract: Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel.

The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR).

Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems.

All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0. 1$ point of the fine-tuned winner.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shimon Malnick, Matan Rusanovsky, Ohad Fried, Shai Avidan

3h ago

Original

Optimal Transport Flow Matching by Design

AI Summary

The study presents a novel approach to optimal transport (OT) flow matching, reformulating the problem by treating the prior as a design choice. This method achieves over 2x reduction in trajectory curvature compared to existing methods, improving generation quality in few-step regimes without altering the flow model. The approach integrates seamlessly with latent-space models and classifier-free guidance.

#AI Coding #Inference #Open Source