Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation
Quick Take
A comprehensive evaluation of 14 open-source safety guard models reveals that Qwen Guard (4B parameters) achieves the highest recall at 83.97%, while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) miss up to 75% of unsafe content. The findings indicate that model size does not correlate with safety performance, emphasizing recall as the critical metric for safety applications.
Key Points
- Qwen Guard outperforms larger models in safety recall metrics.
- Llama Guard and GPT-OSS Safeguard miss significant unsafe content.
- Recall is prioritized over false positives in safety-critical applications.
- General-purpose guard models are more effective than specialized ones.
- The benchmark includes 79,331 samples across 8 safety categories.
Article Content
From source RSS / original summaryarXiv:2605. 28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories.
Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives.
Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83. 97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.
