LemonHarness Technical Report
Quick Answer
LemonHarness is a new execution framework for long-horizon LLM agents, improving accuracy from 84.49% to 86.52% on Terminal-Bench 2.0 by establishing explicit execution boundaries and integrating a reusable rule knowledge base.
Quick Take
LemonHarness is a new execution framework for long-horizon LLM agents, improving accuracy from 84.49% to 86.52% on 2.0 by establishing explicit execution boundaries and integrating a reusable rule knowledge base. This framework enhances state management and time-aware execution, crucial for tasks requiring multiple iterations.
Key Points
- LemonHarness constrains state-changing operations within a defined workspace.
- Achieved 84.49% accuracy with GPT-5.3-CodeX on Terminal-Bench 2.0.
- Accuracy improved to 86.52% using the stronger GPT-5.5 model.
- Introduces a time-aware execution mechanism to optimize resource allocation.
- Utilizes a reusable rule knowledge base for consistent execution criteria.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 24311v1 Announce Type: new Abstract: As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths.
Over time, these weakly constrained changes accumulate, making states such as modified files difficult to track. This paper presents LemonHarness, an integrated execution framework for long-horizon agents. LemonHarness establishes an explicit execution boundary by constraining state-changing operations within a clearly defined workspace and bringing model invocation, tool execution, and rule knowledge within a single controlled boundary.
State-changing operations, including file writes, dependency installation, and temporary artifact creation, are executed through structured tool interfaces, with execution feedback recorded as observations available to subsequent model decisions. The system also introduces a reusable rule knowledge base, which turns recurring execution rules and acceptance criteria into runtime knowledge.
LemonHarness further adds a time-aware execution mechanism that exposes elapsed and remaining budget to the model, so it can rebalance exploration, implementation, and validation effort as time pressure shifts and avoid timeouts from long waits or excessive verification. On 2. 0, LemonHarness_GPT-5. 3-CodeX reached 84. 49% accuracy over 445 trials; pairing the same framework with the stronger GPT-5. 5 backbone raised the average accuracy to 86. 52% across five jobs.
The results suggest that a unified runtime boundary, callable rule knowledge, and time-aware execution can improve the stability of long-horizon agent execution.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.