RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
Quick Answer
RealMath-Eval reveals that state-of-the-art LLM judges struggle with evaluating real human reasoning, showing a Mean Squared Error of ~2.96 compared to expert grading.
Quick Take
RealMath-Eval reveals that state-of-the-art LLM judges struggle with evaluating real human reasoning, showing a Mean Squared Error of ~2.96 compared to expert grading. In contrast, they perform better on synthetic solutions with an MSE of ~1.17, indicating a significant 'Evaluation Gap' in capturing authentic student reasoning diversity.
Key Points
- RealMath-Eval consists of 224 rigorously annotated high school exam responses.
- LLM judges exhibit a high Mean Squared Error of ~2.96 against expert grading.
- Performance on synthetic LLM-generated solutions shows MSE of ~1.17.
- Human reasoning transitions are more out-of-distribution for current models.
- Surface-level style transfer does not close the evaluation gap.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools.
Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2. 96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.
17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models.
Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.