Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery
Quick Answer
This paper shows that Routing accuracy for enterprise LLM assistants degrades significantly as tool catalogs expand, with F1 scores dropping by 16-23 points on under-specified requests.
Quick Take
Routing accuracy for enterprise LLM assistants degrades significantly as tool catalogs expand, with F1 scores dropping by 16-23 points on under-specified requests. An embedding-based shortlisting method recovers 10-11 points across three models, confirmed by a human annotation study showing a 10-17 point recovery in real traffic despite lower absolute performance.
Key Points
- Routing F1 scores drop 16-23 points on under-specified requests across models.
- Oracle analysis reveals retrieval and confusion gaps contributing to performance degradation.
- Embedding-based shortlisting recovers 10-11 points F1 at full scale across all models.
- Human annotation study confirms recovery of 10-17 points in real traffic.
- Performance remains 10-15 points lower in absolute terms despite recovery.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 17519v1 Announce Type: new Abstract: Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models.
An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.