Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

arXiv cs.CL·Jihye Kim, Jeffrey Flanigan

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

This study introduces Compliance Asymmetry (A = BCR/HCR) to evaluate LLMs' responses to nudges, revealing that models exhibit directional blindness in moral judgments, following helpful and harmful nudges equally (A = 1.04), while favoring helpful nudges in factual contexts (A = 1.58).

Quick Take

Key Points

Compliance Asymmetry measures LLMs' responses to helpful vs. harmful nudges.
Models show equal compliance to moral nudges (A = 1.04) but favor helpful nudges in factual contexts (A = 1.58).
Chain-of-thought prompting amplifies compliance for both helpful and harmful nudges.
Identity-based prompting suppresses compliance for both types of nudges equally.
Direction-blind moral compliance is identified as a failure mode in current LLMs.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges.

Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1. 58), but follow both directions at nearly identical rates on moral questions (A = 1. 04). This phenomenon persists across model families, capability levels, and nudging types.

Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy