Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit
Quick Take
Fine-tuned RoBERTa outperforms zero-shot models like Claude Haiku 4.5 in misinformation classification on Reddit, achieving a macro-F1 of 0.62 versus 0.50. This highlights that task-specific tuning is crucial for detecting belief, a category often missed by larger models. Despite the rise of large LLMs, fine-tuning remains the more effective approach for nuanced tasks.
Key Points
- Fine-tuned RoBERTa achieves 0.62 macro-F1, outperforming Claude Haiku 4.5's 0.50.
- Llama-3-8B's performance matches Llama-3-70B, indicating scaling doesn't guarantee better results.
- Zero-shot models struggle with belief detection, a critical aspect in misinformation classification.
- Task-specific fine-tuning is more cost-effective and reliable for nuanced classification tasks.
- Label schema and topic significantly influence zero-shot model performance.
Article Content
From source RSS / original summaryarXiv:2606. 04274v1 Announce Type: new Abstract: As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other.
We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4. 5, Gemini Flash Lite 2. 5, Claude Sonnet 4. 6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0. 62 macro-$F_1$ against a best zero-shot result of 0. 50 (Claude Haiku 4.
5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4. 6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0. 17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit.
Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0. 13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.