Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization
Quick Answer
This paper shows that Fine-tuning small LLMs like Mistral-7B using QLoRA on limited datasets outperforms larger models like GPT-4o and GPT-5 in biomedical claim verification, achieving up to 12% F1 gain at a fraction of the cost.
Quick Take
Fine-tuning small LLMs like Mistral-7B using QLoRA on limited datasets outperforms larger models like GPT-4o and GPT-5 in biomedical claim verification, achieving up to 12% F1 gain at a fraction of the cost. This study highlights the importance of dataset structure for robust cross-domain generalization.
Key Points
- Mistral-7B QLoRA achieves 12% F1 gain over GPT-4o with only 1,008 training examples.
- Study compares small LLMs against larger models using SciFact and HealthVer datasets.
- Identified structural artifact in SciFact inflates in-domain scores, affecting evaluation.
- Bidirectional evaluation shows training on structurally sound data enhances cross-domain transfer.
- All code and adapter checkpoints will be released for public use.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 12854v1 Announce Type: new Abstract: Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3. 8B), Qwen2. 5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders.
Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity.
We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.