CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

arXiv cs.CL·Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

1d ago

·~2 min·5/19/2026·en·3

Quick Take

CHI-Bench benchmarks AI's ability to automate complex healthcare workflows with significant policy and role challenges.

Key Points

Focuses on policy density and multi-role composition.
Evaluates tasks in provider, payer, and care management domains.
Current best performance resolves only 28% of tasks.

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

Authors:Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

View PDF HTML (experimental)

Abstract:End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $\chi$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

Comments:	Website: this https URL Code: this https URL Dataset: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.16679 [cs.CL]
	(or arXiv:2605.16679v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16679 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Haolin Chen [view email]
[v1] Fri, 15 May 2026 22:34:31 UTC (6,764 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency