A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
Quick Take
MAFIG introduces a multi-agent framework for improved difficulty control in reading comprehension item generation.
Key Points
- Utilizes multiple LLM agents for item generation.
- Implements feature-specific evaluators for constraint adherence.
- Achieves higher accuracy in difficulty control than existing methods.
📖 Reader Mode
~2 min readAbstract:Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.
| Comments: | ACL 2026 Main Conference |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.19316 [cs.CL] |
| (or arXiv:2605.19316v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19316 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Seonjeong Hwang [view email]
[v1]
Tue, 19 May 2026 03:52:00 UTC (728 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.