Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
Quick Take
This study investigates rationalization bias in LLM judges, revealing significant cue-anchored rationalization under perturbations. The PROOF-BEFORE-PREFERENCE method notably enhances cue invariance, improving evaluation consistency across 1,000 summaries from traditional and LLM models.
Key Points
- LLM judges show cue-anchored rationalization under label and placebo perturbations.
- Five cue interventions were introduced to assess cue invariance in LLM evaluations.
- PROOF-BEFORE-PREFERENCE significantly improved consistency over baseline methods.
- The study utilized a dataset of 1,000 summaries from various models.
- Findings highlight the need for improved evaluation frameworks in LLM applications.
Article Content
From source RSS / original summaryarXiv:2605. 23970v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i. e. , whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed.
We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank).
Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.