SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

arXiv cs.AI·Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, Ivor W. Tsang

4h ago

·~1 min·6/3/2026·en·0

Quick Take

SkillDAG introduces a self-evolving typed skill graph for LLMs, achieving 67.1% success and 27.3% reward on ALFWorld and SkillsBench, outperforming the Graph-of-Skills baseline by significant margins. The model enhances candidate ranking and recall mechanisms, making it robust as skill libraries expand.

Key Points

SkillDAG models inter-skill relationships as a typed directed graph.
Achieved 67.1% success rate and 27.3% reward on ALFWorld and SkillsBench.
Outperformed Graph-of-Skills baseline by +12.8 success points and +8.6 reward points.
Candidate ranking remains robust even with a 10x increase in skill pool size.
Set-monotone online edits improve recall without losing prior matches.

Article Content

From source RSS / original summary

arXiv:2606. 03056v1 Announce Type: new Abstract: As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity.

We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes.

On ALFWorld and SkillsBench with MiniMax-M2. 7, SkillDAG reaches 67. 1% success and 27. 3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12. 8 and +8. 6 points; the advantage ports to gpt-5. 2-codex, and intrinsic SkillsBench Ret@K rises from 65. 5 to 78. 2 under matched queries.

These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Yan Wang, Xuguang Ai, Jaisal Patel, Xueqing Peng, Fengran Mo, Yupeng Cao, Haohang Li, Mingyu Cao, Lingfei Qian, V\'ictor Guti\'errez-Basulto

4h ago

FeaturedOriginal

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

AI Summary

AuditFlow introduces a multi-agent framework for structured financial reporting verification, achieving 82.09% accuracy with GPT-5.5, outperforming the baseline by 14.93 points. It utilizes a symbolic environment for effective audit processes, demonstrating the necessity of deterministic checks for reliable verification.

#Agent #AI Coding #Inference #Enterprise AI