Automated Scoring of Arabic Text Using Large Language Models: A Literature Review
Quick Answer
This paper shows that This literature review explores LLM-based approaches for Automated Text Scoring (ATS) of Arabic texts, focusing on short answer grading and essay scoring.
Quick Take
This literature review explores LLM-based approaches for Automated Text Scoring (ATS) of Arabic texts, focusing on short answer grading and essay scoring. It introduces a five-dimensional taxonomy for comparative analysis of methodologies, datasets, and performance metrics, emphasizing the need for ongoing research to enhance educational quality in Arabic-speaking communities.
Key Points
- Automated Text Scoring (ATS) enables scalable evaluation of learner responses without human input.
- The study introduces a taxonomy with five dimensions for analyzing ATS methodologies.
- Focus areas include short answer grading (ASAG) and essay scoring (AES) for Arabic texts.
- Findings stress the importance of pedagogically grounded research in Arabic ATS.
- Increased accessibility of LLMs has revitalized interest in Arabic text evaluation.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 09830v1 Announce Type: new Abstract: In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and Arabic-specific datasets has sparked renewed interest in this area. In this work, we investigate LLM-Based approaches for the automated evaluation of Arabic texts, focusing on both short answer grading (ASAG) and essay scoring (AES).
We further introduce a structured taxonomy comprising five dimensions: application domain, feedback generation capability, LLM architecture deployed, alignment with competency referential frameworks, and prompt engineering strategy. By applying this taxonomy, we conduct a comparative analysis of existing studies, examining their methodological approaches, datasets, evaluation metrics, and reported performance.
The findings highlight the need for sustained and pedagogically grounded research efforts in Arabic ATS, given its significance for improving educational quality across Arabic-speaking communities.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.