SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

arXiv cs.CL·Avijit Shil, Suman Samui

5/19/2026

·~2 min·5/19/2026·en·7

Quick Answer

SKG-Eval introduces a quasi-deterministic framework for evaluating multi-turn dialogue systems using Semantic Knowledge Graphs, enhancing detection of long-range inconsistencies.

Quick Take

SKG-Eval introduces a quasi-deterministic framework for evaluating multi-turn dialogue systems using Semantic Knowledge Graphs, enhancing detection of long-range inconsistencies. It achieves higher correlation with human judgments across benchmarks and provides explicit contradiction certificates, enabling reproducible evaluations.

Key Points

SKG-Eval models dialogue as an evolving Semantic Knowledge Graph.
It computes local relevance, historical consistency, and logical coherence signals.
The framework improves detection of contradictions and topic drift in conversations.
SKG-Eval achieves higher correlation with human judgments than existing evaluators.
It enables reproducible evaluations with explicit contradiction certificates.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

View PDF HTML (experimental)

Abstract:Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.

Comments:	36 Pages, 6 Figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.16650 [cs.CL]
	(or arXiv:2605.16650v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16650 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Suman Samui [view email]
[v1] Fri, 15 May 2026 21:39:48 UTC (1,878 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems