FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

arXiv cs.CL·Zhiyun Zhang, Liwen Sun, Xiang Qian, Chenyan Xiong

3h ago

·~1 min·7/3/2026·en·0

Quick Answer

FaithMed enhances medical reasoning by integrating clinician-designed rubrics with reinforcement learning, achieving a 9% improvement over agentic-search baselines and a 15.5% increase in evidence-based rubric scores across seven benchmarks.

Quick Take

Key Points

FaithMed combines clinician-designed rubrics with reinforcement learning for improved medical reasoning.
Achieved a 9% average improvement over agentic-search baselines across seven medical benchmarks.
Increased evidence-based medicine rubric scores by 15.5% compared to agentic-search Qwen3.
Explicit step-level supervision enhances task success and reasoning faithfulness.
Code for FaithMed is available on GitHub.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2607. 01440v1 Announce Type: new Abstract: Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it should be appraised and applied during reasoning.

To address this, we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5. 8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15. 5%).

This work demonstrates that explicit step-level supervision can improve both task success and the faithfulness of the reasoning process. Code is available at https://github. com/cxcscmu/FaithMed.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems