KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty
Quick Answer
KCSAT-ML introduces a benchmark of 664 math problems from the Korean College Scholastic Ability Test, revealing that models often misalign with human difficulty perceptions.
Quick Take
KCSAT-ML introduces a benchmark of 664 math problems from the Korean College Scholastic Ability Test, revealing that models often misalign with human difficulty perceptions. The Difficulty-aligned Reasoning Gain (DRG) metric shows that models can have similar accuracy but differ significantly in handling easy versus hard problems, highlighting critical alignment failures in various VLMs and LLMs.
Key Points
- KCSAT-ML includes a core set of 339 problems with official error rates.
- Models show low-budget accuracy collapses on high-human-error items.
- Test-time scaling increases token use linearly with cohort error rates.
- Models with similar accuracy can misalign on easy versus hard items.
- Open-source code and dataset available at https://github.com/naver-ai/KCSAT-ML.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10403v1 Announce Type: new Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees.
We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy.
Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure.
On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github. com/naver-ai/KCSAT-ML.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.