Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
Quick Take
A new benchmark evaluates interactive reasoning in large language models (LLMs) through 474 executable games, revealing significant performance variances. The study shows that contextual perturbations reduce success rates moderately, while counterfactual reasoning leads to more substantial declines in performance across various LLMs.
Key Points
- The framework evaluates reasoning as active evidence acquisition and belief updating.
- Results indicate large differences in success rates and interaction efficiency among LLMs.
- Contextual perturbations cause moderate declines in performance.
- Counterfactual revision leads to significantly larger drops in success rates.
- The benchmark includes five difficulty levels for comprehensive evaluation.
Article Excerpt
From source RSS / original summaryarXiv:2606. 00103v1 Announce Type: new Abstract: We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer.
Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs.
Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
The In2AI solution introduces delayed per-step reward attribution for training language model agents in multi-agent environments, achieving top performance on the MindGames Arena benchmark at NeurIPS 2025. An 8-billion-parameter model outperformed larger proprietary systems, including GPT-5, in competitive play, demonstrating enhanced stability and sample efficiency in reinforcement learning.