Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

arXiv cs.AI·Hongyang He, Jiuming Liu, Victor Sanchez

3h ago

·~1 min·7/3/2026·en·0

Quick Answer

The paper introduces Semi-CoT, a semi-supervised learning framework leveraging unlabeled questions to generate pseudo reasoning chains for large language models.

Quick Take

The paper introduces Semi-CoT, a semi-supervised learning framework leveraging unlabeled questions to generate pseudo reasoning chains for large language models. Experiments on benchmarks like AQuA and GSM8K show pseudo-answer precision between 91.36% and 100%, indicating potential for effective reasoning signal generation, though challenges remain in demonstration selection.

Key Points

Semi-CoT constructs pseudo reasoning supervision from unlabeled questions.
Pilot experiments show pseudo-answer precision ranging from 91.36% to 100%.
Entropy gate effectively selects high-precision pseudo-CoTs.
Results indicate potential but highlight challenges in demonstration selection.
Negative transfer observed on AQuA; MultiArith performance reached a ceiling.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2607. 01511v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent reasoning capabilities in large language models. However, most existing CoT methods use reasoning chains mainly as inference-time prompts, while the generated reasoning traces are rarely reused as semi-supervised learning signals.

In this report, we define \textbf{Semi-supervised Chain-of-Thought Learning} and propose \textbf{Semi-CoT}, a simple framework that uses unlabeled questions to construct pseudo reasoning supervision. Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations. This extends the self-training view of CoT from inference-time refinement to semi-supervised pseudo-supervision.

Pilot experiments on AQuA, SVAMP, GSM8K, and MultiArith show that the entropy gate selects high-precision pseudo-CoTs, with pseudo-answer precision ranging from $91. 36\%$ to $100\%$. Semi-CoT also gives small gains on SVAMP and GSM8K, while AQuA shows negative transfer and MultiArith reaches a ceiling. These results suggest that unlabeled questions can provide reliable pseudo reasoning signals, but their effective use still requires stronger demonstration selection or student training.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3h ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy