LLM Performance on a Real, Double-Marked GCSE Benchmark

arXiv cs.CL·Malachy Fox, Kavi Samra, Paul Jung

15h ago

·~1 min·6/25/2026·en·2

Quick Answer

This paper shows that A new dataset of 32,534 double-marked GCSE responses reveals that large language models (LLMs) outperform examiner agreement, particularly excelling in subjective tasks like essay marking and complex handwritten scripts.

Quick Take

A new dataset of 32,534 double-marked GCSE responses reveals that large language models (LLMs) outperform examiner agreement, particularly excelling in subjective tasks like essay marking and complex handwritten scripts. This suggests LLMs can provide cost-effective automated marking solutions for educational assessments.

Key Points

Dataset includes 32,534 double-marked responses across 328 questions and five subjects.
Top LLMs agree with examiners better than examiners agree with each other.
Models excel in subjective tasks like English essay marking and messy handwritten scripts.
Agreement is consistent across subjects and not significantly affected by model size.
LLMs offer a cost-effective solution for automated marking in education.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 24973v1 Announce Type: new Abstract: We introduce a dataset of 32,534 double-marked real student responses to GCSE mock exams (GCSEs are the UK's national exams, taken at age ~16), spanning 328 questions across five subjects and including handwritten work. We test whether off-the-shelf large language models agree with examiners as closely as the two examiners agree with each other.

We find that models overwhelmingly agree well with the examiner consensus across subjects, with the top performing models agreeing more closely with examiners than examiners agree with each other. Models achieve high scores for subjective tasks like English essay marking, as well as handling complex and messy handwritten Maths paper scripts. Agreement is uniform near the examiner line, and not massively discriminated by model size, providing cost-effective automated marking solutions.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1d ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

LLM Performance on a Real, Double-Marked GCSE Benchmark

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems