ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence
Quick Answer
ConflictScore introduces a new metric for evaluating language models' handling of conflicting evidence, measuring both the prevalence and balance of claims.
Quick Take
ConflictScore introduces a new metric for evaluating language models' handling of conflicting evidence, measuring both the prevalence and balance of claims. It decomposes responses into claims, using ConflictScore-Count and ConflictScore-Ratio to quantify conflicts. The accompanying ConflictBench benchmark assesses various conflict types, demonstrating effective detection of overconfident claims and improving truthfulness on TruthfulQA.
Key Points
- ConflictScore quantifies model responses' acknowledgment of conflicting evidence.
- Two measures: ConflictScore-Count (CS-C) and ConflictScore-Ratio (CS-R) assess claims.
- ConflictBench benchmark evaluates ambiguity, contradiction, and divergent opinions.
- Experiments show ConflictScore detects overconfident claims across various domains.
- ConflictScore can enhance truthfulness in language models, particularly on TruthfulQA.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 26437v1 Announce Type: new Abstract: Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents.
Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore-Count (CS-C), the proportion of claims exhibiting conflicts, and ConflictScore-Ratio (CS-R), the balance between supporting and contradicting evidence. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric.
Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.