LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks
Quick Take
The LLM-based Constraint Optimization (LCO) framework significantly reduces in-context reward hacking (ICRH) in autonomous agents like GPT-4, achieving a 39% decrease in Toxicity Growth Rate and a 15.23% reduction in ICRH Occurrence Rate without model fine-tuning.
Key Points
- LCO consists of a self-thought module and an evolutionary sampling module.
- It mitigates ICRH without requiring fine-tuning of the LLM.
- On the tweet engagement task, LCO reduced Toxicity Growth Rate by 39%.
- In policy optimization, LCO decreased ICRH Occurrence Rate by 15.23%.
- LCO maintains task performance while enhancing safety.
Article Content
From source RSS / original summaryarXiv:2605. 27375v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model's own over-optimization.
To mitigate this issue, we propose \textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model's actions within a safe solution space while maintaining task performance.
Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15. 23%, demonstrating safety improvement without sacrificing task performance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.