World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments
Quick Answer
The study introduces MedAgentBench-v3, enhancing RL in clinical tasks with a reduced ceiling of 8.9%.
Quick Take
The study introduces MedAgentBench-v3, enhancing RL in clinical tasks with a reduced ceiling of 8.9%. It identifies critical barriers in RL learnability, showing pure RL achieves only 18.2% pass@1 compared to 34.1% for rule-based methods, emphasizing the need for structured feedback and code injection.
Key Points
- MedAgentBench-v3 features 508 tasks with an 8.9% performance ceiling.
- Pure RL achieved 18.2% pass@1, while rule-based SFT reached 34.1%.
- Identified barriers include capability ceiling and format-knowledge issues.
- Training Qwen3-8B revealed significant structural limitations in RL.
- The study advocates for SFT to inject codes and RL for learning conditionals.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2607. 01470v1 Announce Type: new Abstract: Clinical protocol-execution tasks -- checking a lab value, applying a threshold, placing a correctly structured FHIR order -- are natural candidates for RL from world feedback: once clinical SMEs encode decision logic into a verifier, that verifier grades unlimited rollouts without per-episode annotation. But applying RL requires a sound feedback channel and sufficient base capability. We audit MedAgentBench v1/v2, find a 41.
7\% silent-finish ceiling that makes inaction the RL dominant strategy, and construct \textbf{MedAgentBench-v3 (MAB-v3)} (508 tasks, 8. 9\% ceiling). Training Qwen3-8B exposes two structural barriers: a \emph{capability ceiling} (10/20 task types have 0\% base performance, zero gradient) and a \emph{format-knowledge barrier} (3/20 types require exact clinical codes undiscoverable by exploration). Pure RL reaches 18. 2\% pass@1 vs. \ 34. 1\% for rule-based SFT; the 15.
9~pp gap is attributable entirely to these barriers. A decision/format-knowledge/lookup taxonomy predicts RL learnability and prescribes the fix: SFT to inject codes, RL to learn conditionals.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.