Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
Quick Answer
This paper shows that In a study on automated expense itemization using GPT-5 in Microsoft Dynamics 365, selective context pruning and summarization improved itemization accuracy to 91.6% while reducing token usage and runtime significantly.
Quick Take
In a study on automated expense itemization using GPT-5 in Microsoft Dynamics 365, selective context pruning and summarization improved itemization accuracy to 91.6% while reducing token usage and runtime significantly. The best configuration used only 553,374 tokens and took 5.79 hours, demonstrating that efficient can enhance both reliability and efficiency in enterprise workflows.
Key Points
- No-user-model baseline achieved only 8.0% complete itemization.
- Full-context retention improved completion to 71.0%, consuming 1,480,996 tokens.
- Pruning to the last 5 tool calls raised completion to 79.0% with 535,274 tokens.
- Summarization led to the best result: 91.6% complete itemization with 553,374 tokens.
- Results indicate selective retention enhances reliability and efficiency in enterprise workflows.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10209v1 Announce Type: new Abstract: Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using tools.
We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8. 0% complete itemization. Full-context retention improves completion to 71. 0%, but consumes 1,480,996 tokens and 14.
56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79. 0% while reducing token use to 535,274 and runtime to 5. 39 hours. Adding summarization achieves the best result: 91. 6% complete itemization and 99. 64% average amount itemized, with 553,374 tokens and 5. 79 hours.
We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4. 5. These results show that, for this class of enterprise workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.