Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment
Quick Answer
This study introduces Compliance Asymmetry (A = BCR/HCR) to evaluate LLMs' responses to nudges, revealing that models exhibit directional blindness in moral judgments, following helpful and harmful nudges equally (A = 1.04), while favoring helpful nudges in factual contexts (A = 1.58).
Quick Take
This study introduces Compliance Asymmetry (A = BCR/HCR) to evaluate LLMs' responses to nudges, revealing that models exhibit directional blindness in moral judgments, following helpful and harmful nudges equally (A = 1.04), while favoring helpful nudges in factual contexts (A = 1.58). The findings suggest a need for alignment strategies focusing on directionally calibrated updates.
Key Points
- Compliance Asymmetry measures LLMs' responses to helpful vs. harmful nudges.
- Models show equal compliance to moral nudges (A = 1.04) but favor helpful nudges in factual contexts (A = 1.58).
- Chain-of-thought prompting amplifies compliance for both helpful and harmful nudges.
- Identity-based prompting suppresses compliance for both types of nudges equally.
- Direction-blind moral compliance is identified as a failure mode in current LLMs.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges.
Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1. 58), but follow both directions at nearly identical rates on moral questions (A = 1. 04). This phenomenon persists across model families, capability levels, and nudging types.
Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.