PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

arXiv cs.AI·Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama

4h ago

·~2 min·6/1/2026·en·0

Quick Take

PReMISE introduces a framework for evaluating LLM judges based on policy rubrics, enhancing response quality measurement. It improves judge accuracy from 65.0% to 68.6% and reduces exploitative high scores from 46.4% to 36.0%, addressing issues of reliability and adversarial robustness.

Key Points

PReMISE audits rubrics on structural adequacy, reliability, preference fit, and robustness.
No raw rubric source is simultaneously reliable, preference-predictive, and robust.
Preference-rank selection raises judge accuracy on paired responses significantly.
Reliability-constrained refinement reduces exploitative high scores with minimal inter-judge agreement change.

Article Content

From source RSS / original summary

arXiv:2605. 30803v1 Announce Type: new Abstract: LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge.

We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability.

PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65. 0\%$ to $68. 6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.

4\%$ to $36. 0\%$ with little change in inter-judge agreement ($\alpha{=}. 531\to. 519$).

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone

3d ago

FeaturedOriginal

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

AI Summary

The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.

#Agent #Robotics #Security #Policy