MedEvoEval: Evaluating Continual Evolution of Doctor Agents through Simulated Clinical Episodes

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

MedEvoEval introduces a longitudinal evaluation framework for doctor agents, enabling assessment of their evolving clinical decision-making across simulated outpatient episodes.

Quick Take

MedEvoEval introduces a longitudinal evaluation framework for doctor agents, enabling assessment of their evolving clinical decision-making across simulated outpatient episodes. The framework reveals hidden process costs and supports analyses of memory maturation and resource allocation, demonstrating that doctor agents can improve through experience and retain capabilities over time.

Key Points

MedEvoEval evaluates doctor agents across simulated outpatient episodes.
The framework includes 700 processed episodes and structured traces of decisions.
Experiments reveal hidden costs and support longitudinal memory analyses.
Doctor agents can improve their decision-making through experience.
The framework aids in understanding resource allocation in clinical settings.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

View PDF HTML (experimental)

Abstract:Doctor agents are moving beyond single-turn answer generation toward evolving clinical decision systems. Within an outpatient episode, they acquire evidence, use examination and consultation resources, and decide when to finalize a diagnosis and management plan. Across episodes, their behavior may change through memory, retrieval, reflection, or other update mechanisms. Current evaluations only partially cover this setting. Fixed-input medical QA benchmarks score final answers from complete inputs, whereas many interactive benchmarks still focus on individual encounters or fixed runs, providing limited support for evaluating how episode-level decisions interact with cross-episode experience. We introduce MedEvoEval, an executable longitudinal evaluation framework based on action-gated simulated outpatient episodes. Each source case is converted into role-specific patient, examination, and manager views; evidence is revealed only through valid actions; and each episode records a structured trace that links observations, actions, final outputs, manager scores, and optional experience write-back. We release a runnable E&D artifact with 700 processed episodes, provenance notes, schemas, an episode runner, scoring scripts, configurations, example logs, analysis code, and trajectory- and step-level derivatives. Experiments show that episode traces expose process costs hidden by final-answer scoring, show how MDT-style consultation reallocates resources, and support longitudinal analyses of memory maturation, held-out transfer, update-stage response, and backward retention. Together, these results show that MedEvoEval provides a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities over time.

Comments:	31 pages, including appendices
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2606.28900 [cs.AI]
	(or arXiv:2606.28900v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.28900 arXiv-issued DOI via DataCite

Submission history

From: Hui Zhang [view email]
[v1] Sat, 27 Jun 2026 13:14:16 UTC (8,078 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

5d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy