Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

arXiv cs.AI·Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou

6/2/2026

·~2 min·6/2/2026·en·3

Quick Answer

This paper shows that A new benchmark evaluates interactive reasoning in large language models (LLMs) through 474 executable games, revealing significant performance variances.

Quick Take

A new benchmark evaluates interactive reasoning in large language models (LLMs) through 474 executable games, revealing significant performance variances. The study shows that contextual perturbations reduce success rates moderately, while counterfactual reasoning leads to more substantial declines in performance across various LLMs.

Key Points

The framework evaluates reasoning as active evidence acquisition and belief updating.
Results indicate large differences in success rates and interaction efficiency among LLMs.
Contextual perturbations cause moderate declines in performance.
Counterfactual revision leads to significantly larger drops in success rates.
The benchmark includes five difficulty levels for comprehensive evaluation.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 26 May 2026]

View PDF HTML (experimental)

Abstract:We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.

Comments:	preprint version, under review
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.00103 [cs.AI]
	(or arXiv:2606.00103v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.00103 arXiv-issued DOI via DataCite

Submission history

From: Mingyuan Fan [view email]
[v1] Tue, 26 May 2026 09:12:30 UTC (34 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

1d ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup