When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

arXiv cs.CL·Jinlong Liu, Mohammed Bahja, Mark Lee

8h ago

·~2 min·5/21/2026·en·0

Quick Take

TTCW-based literary review generation shows reasoning supervision can hinder performance in long-form evaluations.

Key Points

Constructed a dataset of 263,911 long-form stories.
Non-reasoning fine-tuning outperforms reasoning-supervised models.
Reasoning supervision leads to irrelevant or repetitive outputs.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

Comments:	Submit to EMNLP 2026
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.20364 [cs.CL]
	(or arXiv:2605.20364v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.20364 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jinlong Liu [view email]
[v1] Tue, 19 May 2026 18:16:58 UTC (222 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets