Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

arXiv cs.CL·Jan Net\'ik, Patr\'icia Martinkov\'a

1d ago

·~2 min·5/19/2026·en·1

Quick Take

The study introduces a response-free item difficulty model using fine-tuned transformers for multiple-choice questions.

Key Points

Eliminates manual feature engineering in item difficulty modeling.
Introduces component-wise and multi-task learning approaches.
Demonstrates significant improvements in small sample sizes.

📖 Reader Mode

~2 min read

[Submitted on 16 May 2026]

View PDF HTML (experimental)

Abstract:Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.16991 [cs.CL]
	(or arXiv:2605.16991v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16991 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jan Netík [view email]
[v1] Sat, 16 May 2026 13:22:57 UTC (282 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets