LemonHarness Technical Report

arXiv cs.AI·Kailong Ren, Fubo Sun, Jiachen Liu, Liu Yang, Zimo Yin, Jiaying Li, Congli Yin, Ming He, Yu Huo, Jiawei Liu, Zeping Chen, Yubin Huangfu, Ronghua Li, Yixuan Wu, Xing Su, Yanzhi Xu, Likang Wu, Hongke Zhao, Lei Zhang, Xiaohui Geng, Jianping Fan

4h ago

·~2 min·6/24/2026·en·0

Quick Answer

LemonHarness is a new execution framework for long-horizon LLM agents, improving accuracy from 84.49% to 86.52% on Terminal-Bench 2.0 by establishing explicit execution boundaries and integrating a reusable rule knowledge base.

Quick Take

LemonHarness is a new execution framework for long-horizon LLM agents, improving accuracy from 84.49% to 86.52% on 2.0 by establishing explicit execution boundaries and integrating a reusable rule knowledge base. This framework enhances state management and time-aware execution, crucial for tasks requiring multiple iterations.

Key Points

LemonHarness constrains state-changing operations within a defined workspace.
Achieved 84.49% accuracy with GPT-5.3-CodeX on Terminal-Bench 2.0.
Accuracy improved to 86.52% using the stronger GPT-5.5 model.
Introduces a time-aware execution mechanism to optimize resource allocation.
Utilizes a reusable rule knowledge base for consistent execution criteria.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 24311v1 Announce Type: new Abstract: As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths.

Over time, these weakly constrained changes accumulate, making states such as modified files difficult to track. This paper presents LemonHarness, an integrated execution framework for long-horizon agents. LemonHarness establishes an explicit execution boundary by constraining state-changing operations within a clearly defined workspace and bringing model invocation, tool execution, and rule knowledge within a single controlled boundary.

State-changing operations, including file writes, dependency installation, and temporary artifact creation, are executed through structured tool interfaces, with execution feedback recorded as observations available to subsequent model decisions. The system also introduces a reusable rule knowledge base, which turns recurring execution rules and acceptance criteria into runtime knowledge.

LemonHarness further adds a time-aware execution mechanism that exposes elapsed and remaining budget to the model, so it can rebalance exploration, implementation, and validation effort as time pressure shifts and avoid timeouts from long waits or excessive verification. On 2. 0, LemonHarness_GPT-5. 3-CodeX reached 84. 49% accuracy over 445 trials; pairing the same framework with the stronger GPT-5. 5 backbone raised the average accuracy to 86. 52% across five jobs.

The results suggest that a unified runtime boundary, callable rule knowledge, and time-aware execution can improve the stability of long-horizon agent execution.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

1w ago

FeaturedOriginal

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

AI Summary

Arbor introduces a framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.

#LLM #Agent #Inference #AI Startup