SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework

arXiv cs.CL·Shayan Peyghambari Oskoui, Norah Almousa, Zhaoyi Joey Hou, Carolina Gustafson, Gayle Rogers, Raquel Coelho, Diane Litman, Xiang Lorraine Li

3h ago

·~1 min·7/2/2026·en·0

Quick Answer

SEFORA introduces a public corpus of 564 drafts and 8,240 instructor annotations to enhance writing feedback.

Quick Take

SEFORA introduces a public corpus of 564 drafts and 8,240 instructor annotations to enhance writing feedback. The UniMatch framework evaluates LLM-generated feedback, revealing a maximum F1 score of 0.4 across 74 configurations, indicating challenges in aligning AI feedback with instructor priorities.

Key Points

SEFORA corpus captures real instructor feedback across various college writing genres.
UniMatch framework segments feedback and scores semantic correspondence based on instructor criteria.
No LLM configuration achieved an F1 score exceeding 0.4 in the evaluation.
Models struggle to prioritize feedback that instructors deem important.
Performance decreases as models generate more feedback units.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2607. 00274v1 Announce Type: new Abstract: Effective writing feedback is among the strongest drivers of student learning, yet producing it at scale is labor-intensive. LLMs offer a natural path to scaling writing support, but two gaps stand in the way: few public corpora capture how instructors actually deliver feedback in real classrooms, and no reliable method measures whether generated feedback aligns with what an instructor would write. We address both.

SEFORA is a public corpus pairing instructor inline feedback with assignment prompts, rubrics, scores, and multi-draft revisions across various college writing genres, comprising 564 drafts and 8,240 instructor annotations. UniMatch is a reference-based evaluation framework for open-ended generation: it segments feedback into feedback units, scores their semantic correspondence under instructor-derived criteria, and aligns them via optimal matching to yield interpretable precision, recall, and F1.

Across 74 experimental configurations spanning multiple LLMs, no setting exceeds 0. 4 F1. UniMatch reveals that models struggle to identify the feedback instructors would prioritize, and performance degrades as models generate more.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems