CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO
Quick Take
CAST introduces a novel answer-free self-distillation method for Group Relative Policy Optimization (GRPO) in reinforcement learning with verifiable rewards (RLVR). It enhances token-level advantages based on trajectory correctness, improving mathematical reasoning performance while maintaining a lightweight training objective. This approach addresses the limitations of previous self-distillation methods by allowing for bounded advantages in zero-variance groups.
Key Points
- CAST maintains the GRPO objective while using a stop-gradient self-teacher for token-level advantages.
- The method allows teacher-negative tokens in correct trajectories to receive negative advantages.
- Experiments demonstrate improved RLVR training for mathematical reasoning tasks.
- CAST does not require reference-solution-conditioned teacher scoring, simplifying the training process.
- Zero-variance groups can now contribute verifier-signed token feedback through bounded advantages.
Article Content
From source RSS / original summaryarXiv:2606. 00172v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect.
On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher scoring.
Motivated by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness.
Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages.
For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
The In2AI solution introduces delayed per-step reward attribution for training language model agents in multi-agent environments, achieving top performance on the MindGames Arena benchmark at NeurIPS 2025. An 8-billion-parameter model outperformed larger proprietary systems, including GPT-5, in competitive play, demonstrating enhanced stability and sample efficiency in reinforcement learning.