Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

arXiv cs.CL·Christopher W. Karvetski, Sheldon S. Huang, Simas Ku\v{c}inskas, Nadja Flechner, Jingyu Hu, Philip Tetlock, Ezra Karger

12h ago

·~1 min·7/1/2026·en·0

Quick Answer

The study introduces Explanation Quality Markers (EQMs), a set of sixty reasoning patterns evaluated by large language models, which predict forecasting accuracy better than traditional methods.

Quick Take

The study introduces Explanation Quality Markers (EQMs), a set of sixty reasoning patterns evaluated by large language models, which predict forecasting accuracy better than traditional methods. Analyzing over 55,000 forecast-rationale pairs, EQMs outperform pre-LLM text-analysis techniques and provide a scalable way to assess judgment quality in natural-language explanations.

Key Points

EQMs predict accuracy at forecast and forecaster levels, outperforming pre-LLM techniques.
Over 90% of EQM-accuracy correlations align with directional hypotheses.
EQMs are the strongest predictor of forecast accuracy compared to traditional indicators.
Human ratings of rationale quality correlate less consistently with accuracy.
Results are validated in an independent forecasting study.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 30987v1 Announce Type: new Abstract: Decision-makers routinely rely on expert judgments accompanied by written explanations, yet explanation quality is difficult to measure at scale. Forecasting tournaments offer a natural testing ground: probabilistic judgments are paired with natural-language rationales and scored against realized outcomes. We introduce Explanation Quality Markers (EQMs), a set of sixty theory-guided reasoning patterns scored by large language models (LLMs).

In a pre-registered analysis of over 55,000 forecast-rationale pairs from a multiyear forecasting tournament, EQMs predict accuracy at both the forecast and forecaster levels, consistently outperforming pre-LLM text-analysis methods. More than 90% of statistically significant pattern-level EQM-accuracy correlations match our directional hypotheses. The signal is asymmetric: EQMs identify likely underperformers more reliably than they distinguish the very best forecasters.

Benchmarked against traditional indicators of forecasting skill, EQMs are the strongest predictor at the forecast level and competitive at the forecaster level, though weaker than prior accuracy. Human ratings of rationale quality are less consistently correlated with accuracy and place disproportionate weight on rationale length. Results transfer to an independent forecasting study. EQMs provide a scalable, interpretable method for extracting judgment-relevant information from written explanations.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems