Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

arXiv cs.CL·Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

5/20/2026

·~2 min·5/20/2026·en·4

Quick Answer

The paper introduces Stepwise Confidence Attribution (SCA), a framework for diagnosing multi-step reasoning failures in black-box LLMs without internal access.

Quick Take

The paper introduces Stepwise Confidence Attribution (SCA), a framework for diagnosing multi-step reasoning failures in black-box LLMs without internal access. SCA utilizes the Information Bottleneck principle to assess step-level confidence, leading to a 13.5% improvement in correction success rates over traditional answer-level feedback. Experiments demonstrate its effectiveness in identifying low-confidence steps correlated with reasoning errors.

Key Points

SCA assigns confidence based on reasoning traces, improving error diagnosis in LLMs.
Two methods: NIBS for non-parametric consistency and GIBS for graph-based learning.
Identifies low-confidence steps strongly correlated with reasoning errors.
Achieves up to 13.5% higher correction success rates using step-level feedback.
Applicable to closed-source LLMs, enhancing diagnostic capabilities.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

Comments:	Accepted by ICML 2026
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
MSC classes:	68T50, 68T37, 68Q32
ACM classes:	I.2.7; I.2.6; I.2.4
Cite as:	arXiv:2605.19228 [cs.CL]
	(or arXiv:2605.19228v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19228 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hua Wei [view email]
[v1] Tue, 19 May 2026 00:57:51 UTC (520 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems