MUSE: A Unified Agentic Harness for MLLMs
Quick Take
MUSE introduces a multimodal execution harness that enhances frozen MLLMs without retraining, achieving significant performance improvements across benchmarks in visual spatial planning and reasoning. This approach reveals that many MLLM failures stem from harness-level issues, which can be resolved through verifier-guided repair, emphasizing the importance of agentic multimodal design.
Key Points
- MUSE enhances off-the-shelf MLLMs with composable modules for various tasks.
- Significant performance gains observed across visual spatial planning and multimodal reasoning benchmarks.
- Many MLLM failures are due to harness-level issues, not model deficits.
- Verifier-guided repair can address these harness-level shortcomings effectively.
- MUSE highlights the need for improved multimodal design beyond model-centric optimization.
Article Content
From source RSS / original summaryarXiv:2606. 03005v1 Announce Type: new Abstract: Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it?
We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs.
MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.