The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

arXiv cs.CL·Ali Keramati, Mark Warschauer

6/10/2026

·~2 min·6/10/2026·en·1

Quick Answer

This paper shows that Sequential fine-tuning of LLaMA-3.1-8B for Automated Essay Scoring (AES) significantly outperforms independent and randomized approaches, achieving F1 scores of 65% for evidence and 87% for conclusions.

Quick Take

This method highlights the importance of task dependencies in enhancing coherence and generalization, demonstrating that smaller models can compete with larger ones like LLaMA-70B. The study provides templates for future educational NLP research.

Key Points

Sequential fine-tuning yields the highest F1 scores: 65% for evidence, 87% for conclusions.
Independent models performed worse than sequential tuning in coherence and generalization.
Randomized training improved position scoring but lacked consistency in other areas.
Task-aware curriculum design can enhance Automated Essay Scoring systems significantly.
Small, optimized models can effectively compete with larger language models.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 10327v1 Announce Type: new Abstract: Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e. g. , lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.

1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

5d ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis