SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

5/18/2026

·~2 min·5/18/2026·en·4

Quick Answer

The SDOF framework enhances multi-agent orchestration by enforcing state constraints, achieving 86.5% task completion and outperforming GPT-4o in routing accuracy (80.9% vs.

Quick Take

The SDOF framework enhances orchestration by enforcing state constraints, achieving 86.5% task completion and outperforming GPT-4o in routing accuracy (80.9% vs. 48.9%). It integrates an Online-RLHF Specialized Intent Router and a StateAwareDispatcher for robust execution control, validated through extensive API calls in a recruitment scenario involving 6000+ enterprises.

Key Points

SDOF operates as a constrained state machine for multi-agent execution.
Achieved 86.5% task completion with a 95% confidence interval.
Outperformed GPT-4o with 80.9% accuracy on FSM-constrained routing.
Validated through 1671 live API calls in a recruitment system.
Achieved 100% precision and 88% recall in message-level blocking audit.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 20 Apr 2026]

View PDF HTML (experimental)

Abstract:Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.

Comments:	12 pages, 4 figures, 14 tables
Subjects:	Artificial Intelligence (cs.AI)
ACM classes:	I.2.11; H.4.1
Cite as:	arXiv:2605.15204 [cs.AI]
	(or arXiv:2605.15204v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.15204 arXiv-issued DOI via DataCite

Submission history

From: Zhantao Wang [view email]
[v1] Mon, 20 Apr 2026 12:51:39 UTC (1,651 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy