Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

arXiv cs.AI·Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-P\'erez

4d ago

·~2 min·5/18/2026·en·0

Quick Take

Diverse ensemble monitoring enhances AI safety by improving detection of misaligned actions without requiring more compute.

Key Points

Combining diverse monitors improves detection performance.
Fine-tuning enhances detection capabilities over prompting alone.
Best ensembles balance strong performance with low monitor correlation.

📖 Reader Mode

~2 min read

[Submitted on 14 May 2026]

View PDF HTML (experimental)

Abstract:As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.15377 [cs.AI]
	(or arXiv:2605.15377v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.15377 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yejun Yun [view email]
[v1] Thu, 14 May 2026 20:06:52 UTC (1,710 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Related in this space

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

State Contamination in Memory-Augmented LLM Agents