Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning
Quick Answer
The paper introduces Semi-CoT, a semi-supervised learning framework leveraging unlabeled questions to generate pseudo reasoning chains for large language models.
Quick Take
The paper introduces Semi-CoT, a semi-supervised learning framework leveraging unlabeled questions to generate pseudo reasoning chains for large language models. Experiments on benchmarks like AQuA and GSM8K show pseudo-answer precision between 91.36% and 100%, indicating potential for effective reasoning signal generation, though challenges remain in demonstration selection.
Key Points
- Semi-CoT constructs pseudo reasoning supervision from unlabeled questions.
- Pilot experiments show pseudo-answer precision ranging from 91.36% to 100%.
- Entropy gate effectively selects high-precision pseudo-CoTs.
- Results indicate potential but highlight challenges in demonstration selection.
- Negative transfer observed on AQuA; MultiArith performance reached a ceiling.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01511v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent reasoning capabilities in large language models. However, most existing CoT methods use reasoning chains mainly as inference-time prompts, while the generated reasoning traces are rarely reused as semi-supervised learning signals.
In this report, we define \textbf{Semi-supervised Chain-of-Thought Learning} and propose \textbf{Semi-CoT}, a simple framework that uses unlabeled questions to construct pseudo reasoning supervision. Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations. This extends the self-training view of CoT from inference-time refinement to semi-supervised pseudo-supervision.
Pilot experiments on AQuA, SVAMP, GSM8K, and MultiArith show that the entropy gate selects high-precision pseudo-CoTs, with pseudo-answer precision ranging from $91. 36\%$ to $100\%$. Semi-CoT also gives small gains on SVAMP and GSM8K, while AQuA shows negative transfer and MultiArith reaches a ceiling. These results suggest that unlabeled questions can provide reliable pseudo reasoning signals, but their effective use still requires stronger demonstration selection or student training.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.