Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge
Quick Take
The study introduces a training-free method for generating dynamic evaluation rubrics for LLM-as-a-Judge, achieving competitive performance across four benchmarks. A fine-tuned 14B rubric generator outperforms larger proprietary models, demonstrating the effectiveness of the fine-tuning strategy.
Key Points
- Automatically generates fine-grained evaluation rubrics without human annotation.
- Achieves competitive performance across four benchmarks compared to existing methods.
- Fine-tuned 14B rubric generator outperforms larger proprietary models.
- Introduces iterative fine-tuning via meta-judge reward signals.
- Demonstrates scalability as an alternative to human evaluation.
Article Excerpt
From source RSS / original summaryarXiv:2605. 30568v1 Announce Type: new Abstract: LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained evaluation rubrics without any human annotation. Our training-free method generates rubrics at dataset-specific and instance-specific granularities, achieving performance competitive with existing methods across four benchmarks.
We further present a method that iteratively fine-tunes a rubric generator model via meta-judge reward signals. The fine-tuned generator outperforms all existing baselines in both pairwise and pointwise evaluation. Notably, a fine-tuned 14B rubric generator outperforms a much larger proprietary model at rubric generation, showing the effectiveness of our fine-tuning strategy.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.