Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
Quick Answer
This study introduces structural uncertainty as a framework to evaluate logical reasoning consistency in large language models (LLMs).
Quick Take
This study introduces structural uncertainty as a framework to evaluate logical reasoning consistency in large language models (LLMs). By analyzing pairwise preferences among candidate solutions across five LLMs and eight benchmarks, it reveals that within-trial ambiguity correlates positively with correctness, while across-trial instability indicates unreliable reasoning, enhancing the identification of unreliable instances in logical and mathematical tasks.
Key Points
- Structural uncertainty assesses consistency in LLM reasoning beyond output dispersion.
- The framework uses pairwise preferences to rank candidate solutions effectively.
- Within-trial ambiguity correlates positively with correctness in reasoning tasks.
- Across-trial instability signals unreliable reasoning paths in LLM outputs.
- The approach improves identification of unreliable instances across various benchmarks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 17312v1 Announce Type: new Abstract: Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion -- measuring how much sampled answers differ -- but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates.
We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity.
Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative.
The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness -- consistent with settings where multiple plausible solution paths remain competitive -- while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.