PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference
Quick Answer
PoQ-Judge introduces a reference-free evaluation framework for decentralized LLM inference, achieving a 0.747 Pearson correlation with ground-truth proxies using a DeBERTa judge.
Quick Take
PoQ-Judge introduces a reference-free evaluation framework for decentralized LLM inference, achieving a 0.747 Pearson correlation with ground-truth proxies using a DeBERTa judge. The framework reduces evaluation costs by 72.7% while maintaining quality, outperforming traditional reference-based evaluators.
Key Points
- PoQ-Judge trains judge models for scoring query-output pairs without ground-truth references.
- The DeBERTa judge model achieved the highest Pearson correlation of 0.747.
- Online calibration identifies semantic quality as the main evaluation dimension.
- Cascade evaluation reduces costs by 72.7% with minimal quality loss.
- Performance is significantly better in QA tasks compared to summarization.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 11196v1 Announce Type: new Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge.
Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0. 747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0. 645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers.
We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72. 7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.