How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines
Quick Take
This study investigates the behavioral reproducibility of large language model (LLM) agents in multi-step tool-calling scenarios, revealing significant variability in tool selection and argument consistency across repeated tasks. Unlike previous research focused on simpler agent types, this work emphasizes the complexities of structured interfaces with typed parameters.
Key Points
- LLM agents show inconsistent behavior across identical task repetitions.
- Study focuses on structured tool-calling interfaces, unlike previous free-text action studies.
- Behavioral consistency is critical for deploying LLMs in production systems.
- Research highlights the need for improved reliability in multi-step tool-calling.
- Findings could impact the design of future LLM applications.
Article Excerpt
From source RSS / original summaryarXiv:2605. 28840v1 Announce Type: new Abstract: Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations.
Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.