A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

arXiv cs.CL·Yangqiaoyu Zhou, Mohammad Alqudah, Kwei-Herng Lai, Aaron Halfaker, Yingqi Xiong, Yaar Harari

2h ago

·~2 min·7/1/2026·en·0

Quick Answer

This paper shows that An automated description optimization pipeline for enterprise AI agents reduced engineering effort from 120 minutes to 3.8 minutes while achieving F1 scores of 79.2%, comparable to manually tuned descriptions.

Quick Take

An automated description optimization pipeline for enterprise AI agents reduced engineering effort from 120 minutes to 3.8 minutes while achieving F1 scores of 79.2%, comparable to manually tuned descriptions. The key improvement driver was a single LLM rewrite utilizing false-positive and false-negative cases, highlighting the importance of addressing skill collisions in overlapping descriptions.

Key Points

Automated pipeline achieved 79.2% F1, close to 79.4% of manual tuning.
Engineering effort reduced from 120 minutes to 3.8 minutes, a 32x speedup.
Single LLM rewrite significantly improved routing accuracy by utilizing error cases.
Other design choices had minimal impact on final F1 scores, under 0.5%.
Skill collisions remain unresolved when intended scopes genuinely overlap.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 30775v1 Announce Type: new Abstract: Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term skill collision. As agents scale to dozens of skills, manually tuning descriptions to maintain routing accuracy becomes a significant engineering bottleneck.

We deploy an automated description optimization pipeline on a production enterprise group chat agent (9 skills, 372 regression cases). The pipeline produces descriptions averaging 79. 2% F1, matching manually tuned descriptions at 79. 4% F1 (average per-skill difference -0. 20%, within the 0. 78% multi-seed noise floor), while reducing per-skill engineering effort from 120 minutes to 3. 8 minutes (32 times speedup). We then examine which pipeline components actually drive this match.

Systematic ablation on both the production system and ToolBench (16k tools) reveals that a single LLM rewrite using any available false-positive and false-negative cases captures most of the available improvement. Other design choices we tested (iteration budget, feedback signal composition, dual editing of confused pairs, and training set size) each affect final F1 by less than 0. 5%.

Description optimization addresses skill collisions caused by overlapping descriptions but cannot resolve cases where two skills intended scopes genuinely overlap. We identify a diagnostic (a large train-validation F1 gap) that flags the latter cases for architectural rather than text-level intervention.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems