MAVEN: Improving Generalization in Agentic Tool Calling

arXiv cs.AI·Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

6/1/2026

·~1 min·6/1/2026·en·2

Quick Answer

This paper shows that MAVEN (Modular Agentic Verification and Execution Network) enhances reasoning in agentic tool-calling environments, improving GPT-OSS-120b accuracy from 48% to 71% on MAVEN-Bench without extra training.

Quick Take

MAVEN (Modular Agentic Verification and Execution Network) enhances reasoning in agentic tool-calling environments, improving GPT-OSS-120b accuracy from 48% to 71% on MAVEN-Bench without extra training. This lightweight framework also remains competitive against proprietary models at a cost ratio of 1/10, highlighting its potential for better compositional reasoning.

Key Points

MAVEN introduces a symbolic reasoning scaffold for structured decomposition and adaptive tool orchestration.
Evaluated across benchmarks like BFCL v3 and AceBench, MAVEN-Bench tests multi-step reasoning.
MAVEN's improvements demonstrate a significant gap in reasoning quality versus task success.
The model achieves a 23% accuracy increase without additional training, showcasing efficiency.
MAVEN's cost-effectiveness suggests a new approach for process-aware .

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 30738v1 Announce Type: new Abstract: Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored.

We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition.

MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training.

It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

6h ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup