PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
Quick Take
PReMISE introduces a framework for evaluating LLM judges based on policy rubrics, enhancing response quality measurement. It improves judge accuracy from 65.0% to 68.6% and reduces exploitative high scores from 46.4% to 36.0%, addressing issues of reliability and adversarial robustness.
Key Points
- PReMISE audits rubrics on structural adequacy, reliability, preference fit, and robustness.
- No raw rubric source is simultaneously reliable, preference-predictive, and robust.
- Preference-rank selection raises judge accuracy on paired responses significantly.
- Reliability-constrained refinement reduces exploitative high scores with minimal inter-judge agreement change.
Article Content
From source RSS / original summaryarXiv:2605. 30803v1 Announce Type: new Abstract: LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge.
We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability.
PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65. 0\%$ to $68. 6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.
4\%$ to $36. 0\%$ with little change in inter-judge agreement ($\alpha{=}. 531\to. 519$).
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.