Hallucination Detection-Guided Preference Optimization for Clinical Summarization
Quick Take
The study introduces extit{model} and extit{model} for Preference Learning, which utilize hallucination detectors to significantly reduce hallucinations in clinical summarization tasks. Specifically, extit{model} reduces hallucinations by 24% and extit{model} by 48% in Llama-3.1-8B-Instruct, while maintaining summary fluency and coherence, demonstrating an effective approach for enhancing factual accuracy in healthcare applications.
Key Points
- Introduces extit{model} for iterative summary revisions using hallucination detectors.
- extit{model} reduces hallucinations by 24% in Llama-3.1-8B-Instruct.
- extit{model} reduces hallucinations by 48%, enhancing factual accuracy.
- Methods maintain summary fluency and coherence as per expert evaluations.
- Demonstrates automated solutions for improving clinical summarization reliability.
Article Excerpt
From source RSS / original summaryarXiv:2605. 28910v1 Announce Type: new Abstract: Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce \itermodelfull (\itermodel), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections.
Building on this, we propose \itermodel for Preference Learning (\model), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from \MimicIV. For example, \itermodel reduces 24\% and \model reduces 48\% hallucinations in Llama-3. 1-8B-Instruct.
Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.