Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Quick Take
Diverse ensemble monitoring enhances AI safety by improving detection of misaligned actions without requiring more compute.
Key Points
- Combining diverse monitors improves detection performance.
- Fine-tuning enhances detection capabilities over prompting alone.
- Best ensembles balance strong performance with low monitor correlation.
📖 Reader Mode
~2 min readAbstract:As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.15377 [cs.AI] |
| (or arXiv:2605.15377v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15377 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yejun Yun [view email]
[v1]
Thu, 14 May 2026 20:06:52 UTC (1,710 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →From Prompts to Protocols: An AI Agent for Laboratory Automation
An AI agent integrates large language models for automating laboratory protocols, enhancing efficiency and accuracy.