Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
Quick Answer
This study reveals that LLM judges, such as those evaluated with MT-Bench and AlpacaEval, exhibit high stability under neutral reevaluation but can be significantly influenced by targeted post-decision interactions, leading to altered outcomes and potential misalignment with human preferences.
Quick Take
This study reveals that LLM judges, such as those evaluated with MT-Bench and AlpacaEval, exhibit high stability under neutral reevaluation but can be significantly influenced by targeted post-decision interactions, leading to altered outcomes and potential misalignment with human preferences. The introduction of the Evaluation Robustness Score (ERS) aims to quantify this interactional robustness, highlighting the need for improved evaluation protocols.
Key Points
- LLM judges show high stability during neutral reevaluation but can be reversed by targeted challenges.
- Post-decision interaction can degrade agreement with human preferences and alter benchmark rankings.
- Authority framing destabilizes judgments, leading to low-overlap justifications and potential rationalizations.
- The Evaluation Robustness Score (ERS) quantifies interactional robustness in LLM evaluations.
- Findings suggest a distinct failure mode for LLM-as-judge evaluations requiring new protocols.
Article Content
From source RSS / original summaryarXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made.
Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering.
These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction.
We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.