MindZero: Learning Online Mental Reasoning With Zero Annotations
Quick Take
MindZero introduces a self-supervised reinforcement learning framework for online mental reasoning without annotations, significantly enhancing the ToM capabilities of multimodal large language models (MLLMs). It outperforms traditional model-based methods in both accuracy and efficiency across various tasks, demonstrating effective learning of mental reasoning as a self-supervised skill.
Key Points
- MindZero trains MLLMs for robust online mental reasoning without requiring explicit annotations.
- The framework rewards models for generating hypotheses that align with observed actions.
- MindZero significantly improves accuracy and efficiency over traditional model-based methods.
- Evaluation conducted in gridworld and household domains shows superior performance.
- Self-supervised learning allows for effective mental reasoning skill acquisition.
Article Content
From source RSS / original summaryarXiv:2606. 00240v1 Announce Type: new Abstract: Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains.
We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations.
After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity.
In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
The In2AI solution introduces delayed per-step reward attribution for training language model agents in multi-agent environments, achieving top performance on the MindGames Arena benchmark at NeurIPS 2025. An 8-billion-parameter model outperformed larger proprietary systems, including GPT-5, in competitive play, demonstrating enhanced stability and sample efficiency in reinforcement learning.