Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

arXiv cs.CL·Zijie Wang, Eduardo Blanco

6/1/2026

·~2 min·6/1/2026·en·2

Quick Answer

The study introduces a training-free method for generating dynamic evaluation rubrics for LLM-as-a-Judge, achieving competitive performance across four benchmarks.

Quick Take

The study introduces a training-free method for generating dynamic evaluation rubrics for LLM-as-a-Judge, achieving competitive performance across four benchmarks. A fine-tuned 14B rubric generator outperforms larger proprietary models, demonstrating the effectiveness of the fine-tuning strategy.

Key Points

Automatically generates fine-grained evaluation rubrics without human annotation.
Achieves competitive performance across four benchmarks compared to existing methods.
Fine-tuned 14B rubric generator outperforms larger proprietary models.
Introduces iterative fine-tuning via meta-judge reward signals.
Demonstrates scalability as an alternative to human evaluation.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 28 May 2026]

View PDF HTML (experimental)

Abstract:LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained evaluation rubrics without any human annotation. Our training-free method generates rubrics at dataset-specific and instance-specific granularities, achieving performance competitive with existing methods across four benchmarks. We further present a method that iteratively fine-tunes a rubric generator model via meta-judge reward signals. The fine-tuned generator outperforms all existing baselines in both pairwise and pointwise evaluation. Notably, a fine-tuned 14B rubric generator outperforms a much larger proprietary model at rubric generation, showing the effectiveness of our fine-tuning strategy.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.30568 [cs.CL]
	(or arXiv:2605.30568v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.30568 arXiv-issued DOI via DataCite

Submission history

From: Zijie Wang [view email]
[v1] Thu, 28 May 2026 20:59:45 UTC (54 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

3d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems