Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv cs.AI·Srimonti Dutta, Akshata Kishore Moharir

6/6/2026

·~2 min·6/6/2026·en·1

Quick Answer

This study reveals that LLM judges, such as those evaluated with MT-Bench and AlpacaEval, exhibit high stability under neutral reevaluation but can be significantly influenced by targeted post-decision interactions, leading to altered outcomes and potential misalignment with human preferences.

Quick Take

The introduction of the Evaluation Robustness Score (ERS) aims to quantify this interactional robustness, highlighting the need for improved evaluation protocols.

Key Points

judges show high stability during neutral reevaluation but can be reversed by targeted challenges.
Post-decision interaction can degrade agreement with human preferences and alter benchmark rankings.
Authority framing destabilizes judgments, leading to low-overlap justifications and potential rationalizations.
The Evaluation Robustness Score (ERS) quantifies interactional robustness in LLM evaluations.
Findings suggest a distinct failure mode for LLM-as-judge evaluations requiring new protocols.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 05384v1 Announce Type: new Abstract: -as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made.

…

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Sumit Verma, Pritam Prasun, Pritish Kumar

1d ago

FeaturedOriginal

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

AI Summary

RAIL Guard introduces a closed-loop AI pipeline for large language models (LLMs) that evaluates outputs across eight dimensions and iteratively remediates failures, achieving 96.9% convergence compared to 49.1% for traditional block-and-retry methods. The system reduces unsafe agent executions by 33% without impacting task completion and is available as open-source SDKs.

#LLM #Agent #Open Source #Policy

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System

The Emerging Paradigm of Geospatial Foundation Models: From Pre-Training to Agentic Reasoning

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for LLM Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

The Emerging Paradigm of Geospatial Foundation Models: From Pre-Training to Agentic Reasoning

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System