A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization
Quick Answer
This paper shows that An automated description optimization pipeline for enterprise AI agents reduced engineering effort from 120 minutes to 3.8 minutes while achieving F1 scores of 79.2%, comparable to manually tuned descriptions.
Quick Take
An automated description optimization pipeline for enterprise AI agents reduced engineering effort from 120 minutes to 3.8 minutes while achieving F1 scores of 79.2%, comparable to manually tuned descriptions. The key improvement driver was a single LLM rewrite utilizing false-positive and false-negative cases, highlighting the importance of addressing skill collisions in overlapping descriptions.
Key Points
- Automated pipeline achieved 79.2% F1, close to 79.4% of manual tuning.
- Engineering effort reduced from 120 minutes to 3.8 minutes, a 32x speedup.
- Single LLM rewrite significantly improved routing accuracy by utilizing error cases.
- Other design choices had minimal impact on final F1 scores, under 0.5%.
- Skill collisions remain unresolved when intended scopes genuinely overlap.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 30775v1 Announce Type: new Abstract: Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term skill collision. As agents scale to dozens of skills, manually tuning descriptions to maintain routing accuracy becomes a significant engineering bottleneck.
We deploy an automated description optimization pipeline on a production enterprise group chat agent (9 skills, 372 regression cases). The pipeline produces descriptions averaging 79. 2% F1, matching manually tuned descriptions at 79. 4% F1 (average per-skill difference -0. 20%, within the 0. 78% multi-seed noise floor), while reducing per-skill engineering effort from 120 minutes to 3. 8 minutes (32 times speedup). We then examine which pipeline components actually drive this match.
Systematic ablation on both the production system and ToolBench (16k tools) reveals that a single LLM rewrite using any available false-positive and false-negative cases captures most of the available improvement. Other design choices we tested (iteration budget, feedback signal composition, dual editing of confused pairs, and training set size) each affect final F1 by less than 0. 5%.
Description optimization addresses skill collisions caused by overlapping descriptions but cannot resolve cases where two skills intended scopes genuinely overlap. We identify a diagnostic (a large train-validation F1 gap) that flags the latter cases for architectural rather than text-level intervention.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.