LLM Performance on a Real, Double-Marked GCSE Benchmark
Quick Answer
This paper shows that A new dataset of 32,534 double-marked GCSE responses reveals that large language models (LLMs) outperform examiner agreement, particularly excelling in subjective tasks like essay marking and complex handwritten scripts.
Quick Take
A new dataset of 32,534 double-marked GCSE responses reveals that large language models (LLMs) outperform examiner agreement, particularly excelling in subjective tasks like essay marking and complex handwritten scripts. This suggests LLMs can provide cost-effective automated marking solutions for educational assessments.
Key Points
- Dataset includes 32,534 double-marked responses across 328 questions and five subjects.
- Top LLMs agree with examiners better than examiners agree with each other.
- Models excel in subjective tasks like English essay marking and messy handwritten scripts.
- Agreement is consistent across subjects and not significantly affected by model size.
- LLMs offer a cost-effective solution for automated marking in education.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 24973v1 Announce Type: new Abstract: We introduce a dataset of 32,534 double-marked real student responses to GCSE mock exams (GCSEs are the UK's national exams, taken at age ~16), spanning 328 questions across five subjects and including handwritten work. We test whether off-the-shelf large language models agree with examiners as closely as the two examiners agree with each other.
We find that models overwhelmingly agree well with the examiner consensus across subjects, with the top performing models agreeing more closely with examiners than examiners agree with each other. Models achieve high scores for subjective tasks like English essay marking, as well as handling complex and messy handwritten Maths paper scripts. Agreement is uniform near the examiner line, and not massively discriminated by model size, providing cost-effective automated marking solutions.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.