SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs
Quick Answer
SKG-Eval introduces a quasi-deterministic framework for evaluating multi-turn dialogue systems using Semantic Knowledge Graphs, enhancing detection of long-range inconsistencies.
Quick Take
SKG-Eval introduces a quasi-deterministic framework for evaluating multi-turn dialogue systems using Semantic Knowledge Graphs, enhancing detection of long-range inconsistencies. It achieves higher correlation with human judgments across benchmarks and provides explicit contradiction certificates, enabling reproducible evaluations.
Key Points
- SKG-Eval models dialogue as an evolving Semantic Knowledge Graph.
- It computes local relevance, historical consistency, and logical coherence signals.
- The framework improves detection of contradictions and topic drift in conversations.
- SKG-Eval achieves higher correlation with human judgments than existing evaluators.
- It enables reproducible evaluations with explicit contradiction certificates.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.
| Comments: | 36 Pages, 6 Figures |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.16650 [cs.CL] |
| (or arXiv:2605.16650v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16650 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Suman Samui [view email]
[v1]
Fri, 15 May 2026 21:39:48 UTC (1,878 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.