Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning
Quick Take
The study introduces a response-free item difficulty model using fine-tuned transformers for multiple-choice questions.
Key Points
- Eliminates manual feature engineering in item difficulty modeling.
- Introduces component-wise and multi-task learning approaches.
- Demonstrates significant improvements in small sample sizes.
📖 Reader Mode
~2 min readAbstract:Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.16991 [cs.CL] |
| (or arXiv:2605.16991v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16991 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Jan Netík [view email]
[v1]
Sat, 16 May 2026 13:22:57 UTC (282 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.