World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

arXiv cs.AI·Ananya Mantravadi, Harshit Rajgarhia, Prasanna Desikan, Abhishek Mukherji

3h ago

·~1 min·7/3/2026·en·0

Quick Answer

The study introduces MedAgentBench-v3, enhancing RL in clinical tasks with a reduced ceiling of 8.9%.

Quick Take

The study introduces MedAgentBench-v3, enhancing RL in clinical tasks with a reduced ceiling of 8.9%. It identifies critical barriers in RL learnability, showing pure RL achieves only 18.2% pass@1 compared to 34.1% for rule-based methods, emphasizing the need for structured feedback and code injection.

Key Points

MedAgentBench-v3 features 508 tasks with an 8.9% performance ceiling.
Pure RL achieved 18.2% pass@1, while rule-based SFT reached 34.1%.
Identified barriers include capability ceiling and format-knowledge issues.
Training Qwen3-8B revealed significant structural limitations in RL.
The study advocates for SFT to inject codes and RL for learning conditionals.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2607. 01470v1 Announce Type: new Abstract: Clinical protocol-execution tasks -- checking a lab value, applying a threshold, placing a correctly structured FHIR order -- are natural candidates for RL from world feedback: once clinical SMEs encode decision logic into a verifier, that verifier grades unlimited rollouts without per-episode annotation. But applying RL requires a sound feedback channel and sufficient base capability. We audit MedAgentBench v1/v2, find a 41.

7\% silent-finish ceiling that makes inaction the RL dominant strategy, and construct \textbf{MedAgentBench-v3 (MAB-v3)} (508 tasks, 8. 9\% ceiling). Training Qwen3-8B exposes two structural barriers: a \emph{capability ceiling} (10/20 task types have 0\% base performance, zero gradient) and a \emph{format-knowledge barrier} (3/20 types require exact clinical codes undiscoverable by exploration). Pure RL reaches 18. 2\% pass@1 vs. \ 34. 1\% for rule-based SFT; the 15.

9~pp gap is attributable entirely to these barriers. A decision/format-knowledge/lookup taxonomy predicts RL learnability and prescribes the fix: SFT to inject codes, RL to learn conditionals.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3h ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy