Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

arXiv cs.CL·Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali

5/29/2026

·~1 min·5/29/2026·en·5

Quick Answer

This paper shows that A comprehensive evaluation of 14 open-source safety guard models reveals that Qwen Guard (4B parameters) achieves the highest recall at 83.97%, while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) miss up to 75% of unsafe content.

Quick Take

A comprehensive evaluation of 14 open-source safety guard models reveals that Qwen Guard (4B parameters) achieves the highest recall at 83.97%, while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) miss up to 75% of unsafe content. The findings indicate that model size does not correlate with safety performance, emphasizing recall as the critical metric for safety applications.

Key Points

Qwen Guard outperforms larger models in safety recall metrics.
Llama Guard and GPT-OSS Safeguard miss significant unsafe content.
Recall is prioritized over false positives in safety-critical applications.
General-purpose guard models are more effective than specialized ones.
The benchmark includes 79,331 samples across 8 safety categories.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories.

Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives.

Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83. 97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

1d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems