Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
Quick Answer
This study evaluates DExperts, a mitigation technique for Large Language Models (LLMs) like GPT-2, achieving 100% safety against explicit toxicity but only 98.5% against implicit hate speech, introducing a 10x latency increase.
Quick Take
This study evaluates DExperts, a mitigation technique for Large Language Models (LLMs) like GPT-2, achieving 100% safety against explicit toxicity but only 98.5% against implicit hate speech, introducing a 10x latency increase. It highlights the need for advanced methods to address diverse toxicity patterns without high computational costs.
Key Points
- DExperts achieves 100% safety on explicit toxicity benchmarks using RealToxicityPrompts.
- Safety rates drop to 98.5% against implicit hate speech from the ToxiGen dataset.
- The method incurs a latency penalty, increasing generation time from 0.2s to 2.0s.
- This study identifies a robustness gap in toxicity mitigation strategies.
- Emphasizes the need for cost-effective solutions that generalize across hate speech patterns.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2605. 14087v1 Announce Type: new Abstract: Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to ``toxic degeneration'' where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety.
In this comprehensive replication study, we evaluate the efficacy of \textbf{DExperts} (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining.
We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using \textbf{RealToxicityPrompts} on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial \textbf{ToxiGen} dataset.
Our empirical results confirm that while DExperts achieves near-perfect safety rates (100\%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98. 5\%. Furthermore, we quantify a critical trade-off. The method introduces a $\sim$10x latency penalty (from 0. 2s to 2. 0s per generation), posing challenges for real-time deployment scenarios.
This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.


