CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

arXiv cs.AI·Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan, Liwen Hu, Lei Ma

6/2/2026

·~2 min·6/2/2026·en·4

Quick Answer

CAST introduces a novel answer-free self-distillation method for Group Relative Policy Optimization (GRPO) in reinforcement learning with verifiable rewards (RLVR).

Quick Take

CAST introduces a novel answer-free self-distillation method for Group Relative Policy Optimization (GRPO) in reinforcement learning with verifiable rewards (RLVR). It enhances token-level advantages based on trajectory correctness, improving mathematical reasoning performance while maintaining a lightweight training objective. This approach addresses the limitations of previous self-distillation methods by allowing for bounded advantages in zero-variance groups.

Key Points

CAST maintains the GRPO objective while using a stop-gradient self-teacher for token-level advantages.
The method allows teacher-negative tokens in correct trajectories to receive negative advantages.
Experiments demonstrate improved RLVR training for mathematical reasoning tasks.
CAST does not require reference-solution-conditioned teacher scoring, simplifying the training process.
Zero-variance groups can now contribute verifier-signed token feedback through bounded advantages.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 29 May 2026]

View PDF HTML (experimental)

Abstract:Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect. On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher this http URL by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness. Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages. For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.

Comments:	10 pages
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.00172 [cs.AI]
	(or arXiv:2606.00172v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.00172 arXiv-issued DOI via DataCite

Submission history

From: Yang Li [view email]
[v1] Fri, 29 May 2026 13:21:30 UTC (446 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

1d ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup