ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

arXiv cs.AI·Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

5/19/2026

·~2 min·5/19/2026·en·3

Quick Answer

Quick Take

ANNEAL introduces a neuro-symbolic agent that repairs recurring execution errors in LLMs through governed symbolic edits, achieving a 0% failure rate in tested scenarios, unlike strong baselines like ReAct and Reflexion which retain 72-100% failure rates. The core mechanism, Failure-Driven Knowledge Acquisition (FDKA), ensures structural repairs and enhances persistent fault elimination without modifying model weights.

Key Points

ANNEAL achieves 0% failure rates in recurring-failure settings across four domains.
Failure-Driven Knowledge Acquisition (FDKA) localizes errors and synthesizes repairs.
Strong baselines like ReAct and Reflexion retain high failure rates despite episodic recovery.
Every accepted edit in ANNEAL includes full provenance and rollback capabilities.
Removing FDKA eliminates structural repairs, dropping success rates by up to 26.7 percentage points.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 4 May 2026]

View PDF HTML (experimental)

Abstract:LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72-100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

Comments:	Code Implementation: this https URL
Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Cite as:	arXiv:2605.16309 [cs.AI]
	(or arXiv:2605.16309v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.16309 arXiv-issued DOI via DataCite

Submission history

From: Safayat Bin Hakim [view email]
[v1] Mon, 4 May 2026 05:24:03 UTC (602 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

1d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy