Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

arXiv cs.CL·Kellen Gillespie, Robyn Perry

6/17/2026

·~1 min·6/17/2026·en·3

Quick Answer

This paper shows that Routing accuracy for enterprise LLM assistants degrades significantly as tool catalogs expand, with F1 scores dropping by 16-23 points on under-specified requests.

Quick Take

An embedding-based shortlisting method recovers 10-11 points across three models, confirmed by a human annotation study showing a 10-17 point recovery in real traffic despite lower absolute performance.

Key Points

Routing F1 scores drop 16-23 points on under-specified requests across models.
Oracle analysis reveals retrieval and confusion gaps contributing to performance degradation.
Embedding-based shortlisting recovers 10-11 points F1 at full scale across all models.
Human annotation study confirms recovery of 10-17 points in real traffic.
Performance remains 10-15 points lower in absolute terms despite recovery.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 17519v1 Announce Type: new Abstract: Production assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models.

An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

1w ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis