CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
Quick Answer
The CoRA framework enhances chain-of-thought reasoning in LLMs by aligning confidence with rationale support, reducing alignment errors by up to 26.51% across MedQA, MathQA, and OpenBookQA benchmarks.
Quick Take
The CoRA framework enhances chain-of-thought reasoning in LLMs by aligning confidence with rationale support, reducing alignment errors by up to 26.51% across MedQA, MathQA, and OpenBookQA benchmarks. This method utilizes a GRPO-based reinforcement learning approach, ensuring that confident answers are backed by substantial rationales, thus improving model reliability.
Key Points
- CoRA reduces confidence-rationale alignment errors by up to 26.51%.
- Utilizes a GRPO-based reinforcement learning framework for improved reasoning.
- Maintains competitive accuracy while enhancing calibration of LLMs.
- Evaluated on MedQA, MathQA, and OpenBookQA benchmarks.
- Emphasizes the importance of substantial rationales for confident answers.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale.
We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.
51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.