An AI agent for treatment reasoning over a biomedical tool universe

arXiv cs.AI·Shanghua Gao, Ayush Noori, Richard Zhu, Curtis Ginder, Zhenglun Kong, Xiaorui Su, Justin Kauffman, Benjamin S. Glicksberg, Joshua Lampert, Ankit Sakhuja, Ashwin Sawant, ATHENA-R1 Evaluation Consortium, David A. Clifton, Noa Dagan, Ran Balicer, Marinka Zitnik

21h ago

·~2 min·6/30/2026·en·0

Quick Answer

ATHENA-R1 is an AI agent for treatment reasoning, outperforming existing models with 94.7% accuracy in drug reasoning and 82.9% in treatment reasoning.

Quick Take

ATHENA-R1 is an AI agent for treatment reasoning, outperforming existing models with 94.7% accuracy in drug reasoning and 82.9% in treatment reasoning. Trained using reinforcement learning across 3,168 drug tasks and 456 patient cases, it shows significant improvements over GPT-5 by 17.8 and 10.7 points respectively.

Key Points

ATHENA-R1 integrates 212 biomedical tools for comprehensive treatment reasoning.
Achieved 94.7% accuracy on drug reasoning tasks across five benchmarks.
Outperformed GPT-5 by 17.8 points in drug reasoning accuracy.
Preferred by experts over reference models in evaluations from 28 rare disease organizations.
Generated adverse-event hypotheses with adjusted odds ratios of 1.48-1.84.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

Authors:Shanghua Gao, Ayush Noori, Richard Zhu, Curtis Ginder, Zhenglun Kong, Xiaorui Su, Justin Kauffman, Benjamin S. Glicksberg, Joshua Lampert, Ankit Sakhuja, Ashwin Sawant, ATHENA-R1 Evaluation Consortium, David A. Clifton, Noa Dagan, Ran Balicer, Marinka Zitnik

View PDF HTML (experimental)

Abstract:Treatment reasoning underpins every therapeutic decision, integrating disease context, comorbidities, medications, contraindications, and evolving biomedical knowledge to select an appropriate therapy. It is inherently iterative: candidates are weighed against many constraints, revised as evidence emerges, and grounded in verifiable sources. Here we introduce ATHENA-R1, an AI agent for treatment reasoning across all FDA approved drugs since 1939, trained by reinforcement learning over a universe of 212 biomedical tools. At each step it identifies missing information, selects and runs relevant tools, and incorporates the evidence. To train it without human-annotated traces, we build a two-level self-learning framework: multi-agent systems construct the tools, tasks, and reasoning trajectories for supervised fine-tuning, then reinforcement learning with scientific feedback rewards reasoning quality (evidence gathering, grounded tool use, logical non-redundancy). Across five benchmarks of 3,168 drug reasoning tasks and 456 patient treatment cases, ATHENA-R1 outperforms language models and tool-use systems, reaching 94.7% accuracy on open-ended drug reasoning and 82.9% on treatment reasoning, 17.8 and 10.7 points above GPT-5. In blinded evaluations by experts from 28 rare disease organizations, it is preferred over reference models on all criteria, and physicians rated it favorably on complex hospitalized cardiovascular and infectious-disease cases. Adverse-event hypotheses it generated, tested in electronic health records from 5.4 million patients, reached adjusted odds ratios of 1.48-1.84, with no elevation among negative controls. Because it requires knowing what evidence to seek before concluding, treatment reasoning has long been hard for AI; we show it can be reframed as a learnable process of iterative evidence gathering that reinforcement learning can train AI to perform.

Comments:	Project page: this https URL Code: this https URL
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.28692 [cs.AI]
	(or arXiv:2606.28692v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.28692 arXiv-issued DOI via DataCite

Submission history

From: Shanghua Gao [view email]
[v1] Sat, 27 Jun 2026 02:24:56 UTC (8,433 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

4d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy