Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents
Quick Answer
Evoflux enhances the execution feasibility of compact language models in tool workflows from 3% to 17-24% on MCP-Bench tasks, outperforming SFT and ReAct under limited teacher-trace budgets.
Quick Take
Evoflux enhances the execution feasibility of compact language models in tool workflows from 3% to 17-24% on -Bench tasks, outperforming SFT and ReAct under limited teacher-trace budgets. This evolutionary search method effectively repairs executable workflows through structured edits and adaptive feedback.
Key Points
- Evoflux improves execution feasibility for small planners on MCP-Bench tasks.
- Execution feasibility rose from 3% to 17-24% using Evoflux.
- SFT and SFT+DPO underperform compared to Evoflux on the same data.
- ReAct achieves higher peaks but with increased variance and costs.
- Evoflux utilizes evolutionary search to repair executable tool workflows.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12674v1 Announce Type: new Abstract: Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet -style requires more than isolated : an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution.
We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning.
On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.