FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning
Quick Answer
FaithMed enhances medical reasoning by integrating clinician-designed rubrics with reinforcement learning, achieving a 9% improvement over agentic-search baselines and a 15.5% increase in evidence-based rubric scores across seven benchmarks.
Quick Take
FaithMed enhances medical reasoning by integrating clinician-designed rubrics with reinforcement learning, achieving a 9% improvement over agentic-search baselines and a 15.5% increase in evidence-based rubric scores across seven benchmarks. This framework ensures transparent, evidence-grounded clinical decisions.
Key Points
- FaithMed combines clinician-designed rubrics with reinforcement learning for improved medical reasoning.
- Achieved a 9% average improvement over agentic-search baselines across seven medical benchmarks.
- Increased evidence-based medicine rubric scores by 15.5% compared to agentic-search Qwen3.
- Explicit step-level supervision enhances task success and reasoning faithfulness.
- Code for FaithMed is available on GitHub.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2607. 01440v1 Announce Type: new Abstract: Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it should be appraised and applied during reasoning.
To address this, we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5. 8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15. 5%).
This work demonstrates that explicit step-level supervision can improve both task success and the faithfulness of the reasoning process. Code is available at https://github. com/cxcscmu/FaithMed.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.