Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG
Quick Answer
This paper identifies deductive stereotyping in large language models (LLMs), where models make biased inferences based on population statistics.
Quick Take
This paper identifies deductive stereotyping in large language models (LLMs), where models make biased inferences based on population statistics. To counteract this, the authors propose Fair-GCG, a framework that enhances fairness-aware reasoning by discovering effective injection phrases, leading to improved performance on fairness benchmarks and real-world tasks.
Key Points
- Deductive stereotyping leads to biased inferences in LLMs despite improved reasoning.
- Fair-GCG systematically discovers injection phrases to enhance fairness in reasoning.
- The proposed framework improves performance across multiple fairness benchmarks.
- Fair-GCG generalizes from smaller to larger LLMs, reducing bias in outputs.
- The approach transfers effectively to real-world fairness-sensitive applications.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 30989v1 Announce Type: new Abstract: Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences. We provide a statistical interpretation of this phenomenon.
To steer models toward fairness-aware reasoning, we propose a reasoning-time injection framework. We further introduce Fair-GCG to systematically discover effective injection phrases. Injection phrases discovered by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, improves reasoning-level fairness, reduces bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.