The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring
Quick Answer
This paper shows that Sequential fine-tuning of LLaMA-3.1-8B for Automated Essay Scoring (AES) significantly outperforms independent and randomized approaches, achieving F1 scores of 65% for evidence and 87% for conclusions.
Quick Take
Sequential fine-tuning of LLaMA-3.1-8B for Automated Essay Scoring (AES) significantly outperforms independent and randomized approaches, achieving F1 scores of 65% for evidence and 87% for conclusions. This method highlights the importance of task dependencies in enhancing coherence and generalization, demonstrating that smaller models can compete with larger ones like LLaMA-70B. The study provides templates for future educational NLP research.
Key Points
- Sequential fine-tuning yields the highest F1 scores: 65% for evidence, 87% for conclusions.
- Independent models performed worse than sequential tuning in coherence and generalization.
- Randomized training improved position scoring but lacked consistency in other areas.
- Task-aware curriculum design can enhance Automated Essay Scoring systems significantly.
- Small, optimized models can effectively compete with larger language models.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10327v1 Announce Type: new Abstract: Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e. g. , lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.
1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.
0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere.
These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.