Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents
Quick Answer
Trace2Policy introduces EISR for refining decision rules in compliance tasks, achieving 79.6% accuracy with Python execution, outperforming LLMs by 9.8 percentage points.
Quick Take
Trace2Policy introduces EISR for refining decision rules in compliance tasks, achieving 79.6% accuracy with Python execution, outperforming LLMs by 9.8 percentage points. Auto-EISR reduces refinement costs to $5–$10 per cycle, significantly improving efficiency over expert hours.
Key Points
- EISR improves compliance decision rules through iterative error analysis.
- Achieved 79.6% accuracy with Python, outperforming LLMs at 72.7%.
- Auto-EISR costs $5–$10 per refinement cycle versus $70 expert-hours.
- Deployed for 22 days, processed 3,349 audit cases in logistics.
- Performance gains are significant in skewed-base-rate decision tasks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10457v1 Announce Type: new Abstract: Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and improved through iterative error analysis.
We present \textbf{Trace2Policy}, whose core mechanism -- \textbf{EISR} (\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement) -- maintains a human-readable rule document as its optimization target: each round executes the rules on a validation set, clusters errors by root cause into MISSING, WRONG, or CONFLICT types, applies targeted patches, and commits only those that pass a regression gate.
\textbf{For this class of compliance-sensitive, skewed-base-rate decision tasks, we identify rule quality -- not model capability -- as the dominant performance lever}: across five LLMs, one-shot distillation plateaus near $\sim$70\% on the deployed pool, while eight EISR rounds lift the same rules to 79. 6\% when compiled into deterministic Python -- zero LLM calls at inference. \textbf{Execution form compounds the gain: in production, the same EISR-refined content runs 9.
8~pp higher as compiled Python than as an LLM prompt, a form-and-engineering bundle the 22-day deployment matured together. } Deployed for 22 days at a major logistics carrier (3,349 audit cases), the compiled pipeline outperforms the pure-LLM baseline it replaced (72. 7\%); on these calibrated, skewed-base-rate workloads, re-enabling LLM fallback monotonically degrades accuracy.
An LLM-driven variant, \textbf{Auto-EISR}, reproduces this refinement at \$5--\$10 per cycle versus $\sim$70 expert-hours, and transfers to four public benchmarks spanning legal reasoning (LegalBench) and process-mining decisions (BPIC 2012) without re-engineering.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.