INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration
Quick Answer
INFRAMIND introduces an infrastructure-aware multi-agent orchestration framework that optimizes model selection and scheduling based on real-time system load.
Quick Take
INFRAMIND introduces an infrastructure-aware multi-agent orchestration framework that optimizes model selection and scheduling based on real-time system load. It achieves up to 7.6 percentage points higher accuracy and 7x lower latency compared to previous methods, while maintaining 99.9% SLO compliance under high load conditions.
Key Points
- INFRAMIND adapts model selection based on dynamic infrastructure signals like queue depths and latencies.
- The framework uses reinforcement learning to balance quality and latency effectively.
- It outperforms previous baselines with significantly improved accuracy and reduced latency.
- Achieves up to 99.9% SLO compliance under high load, where other methods fail.
- Optimizes multi-agent pipelines by prioritizing urgent requests and simpler topologies during congestion.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11440v1 Announce Type: new Abstract: Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle.
In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware.
An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first.
Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7. 6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99. 9% SLO compliance under high load where every baseline drops below 50%.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.