Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph
Quick Answer
Regimes introduces an auditable improvement loop on ActiveGraph, enhancing LongMemEval-S accuracy by up to +0.10 through systematic failure diagnosis and repair.
Quick Take
Regimes introduces an auditable improvement loop on ActiveGraph, enhancing LongMemEval-S accuracy by up to +0.10 through systematic failure diagnosis and repair. This approach leverages event-sourced agent runtime to ensure transparency in the improvement process, making it applicable across various tasks.
Key Points
- Regimes operates on ActiveGraph, enabling controlled improvement loops.
- Achieved accuracy improvements of +0.05 to +0.10 on LongMemEval-S.
- Failures are recorded and replayed, ensuring transparency in the process.
- The loop is target-agnostic, functioning across different tasks.
- Introduces a failure-regime taxonomy for effective routing of issues.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10241v1 Announce Type: new Abstract: Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent's own history. We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow.
When the agent's state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event. We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation.
The loop is target-agnostic: the same control flow runs against different tasks through a common interface. On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0. 05 to +0. 10 in four splits and +0.
01 in one over-promotion split; two splits are individually significant (seed 5 unadjusted for its sequential promotion structure), and the pooled count is descriptive only, since the splits share one 500-question pool.
The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location (whose marginal value over an unrouted baseline is the primary open question), and the prompt-as-discovery-probe hypothesis.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.