Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

arXiv cs.CV·Hartwig Grabowski

2d ago

·~2 min·6/11/2026·en·0

Quick Answer

This paper shows that A new approach using vision-language foundation models achieves 98.4% accuracy in automated grading of handwritten exams, significantly improving fairness by reducing false negatives to 0.58%.

Quick Take

A new approach using vision-language foundation models achieves 98.4% accuracy in automated grading of handwritten exams, significantly improving fairness by reducing false negatives to 0.58%. This method addresses previous limitations in recognizing diverse answer formats, making it viable for large-scale, unsupervised grading.

Key Points

Vision-language models close the recognition gap in grading handwritten exams.
Accuracy improved from 88%-91% to 98.4% on a benchmark of 61 exams.
False-negative rate reduced to 0.58% with context-aware prompting.
Only three exams out of 61 graded worse, all identified in student self-review.
Anonymized benchmark released for reproducibility and further research.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 11477v1 Announce Type: new Abstract: Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read.

The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap.

On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98. 4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0. 58%.

Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup