Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models
Quick Answer
This paper shows that A new approach using vision-language foundation models achieves 98.4% accuracy in automated grading of handwritten exams, significantly improving fairness by reducing false negatives to 0.58%.
Quick Take
A new approach using vision-language foundation models achieves 98.4% accuracy in automated grading of handwritten exams, significantly improving fairness by reducing false negatives to 0.58%. This method addresses previous limitations in recognizing diverse answer formats, making it viable for large-scale, unsupervised grading.
Key Points
- Vision-language models close the recognition gap in grading handwritten exams.
- Accuracy improved from 88%-91% to 98.4% on a benchmark of 61 exams.
- False-negative rate reduced to 0.58% with context-aware prompting.
- Only three exams out of 61 graded worse, all identified in student self-review.
- Anonymized benchmark released for reproducibility and further research.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11477v1 Announce Type: new Abstract: Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read.
The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap.
On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98. 4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0. 58%.
Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.