BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards
Quick Answer
BV-Blend introduces a critic-free reinforcement learning framework that stabilizes advantage estimation by blending prompt-local statistics with historical moments, enhancing training stability and performance in cold-start scenarios.
Quick Take
BV-Blend introduces a critic-free reinforcement learning framework that stabilizes advantage estimation by blending prompt-local statistics with historical moments, enhancing training stability and performance in cold-start scenarios. It addresses the instability of Group Relative Policy Optimization (GRPO) when rewards are identical across rollouts, improving robustness in verifiable reasoning benchmarks.
Key Points
- BV-Blend combines on-policy statistics with historical moments for stable advantage estimation.
- The framework improves upon GRPO, which struggles with identical rewards in rollouts.
- Experiments demonstrate enhanced training stability and performance on reasoning benchmarks.
- BV-Blend uses EMA-tracked moments and confidence weights for standardized advantage.
- It remains robust in scenarios where group-normalized methods typically stall.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Critic-free reinforcement learning with verifiable rewards (RLVR), exemplified by Group Relative Policy Optimization (GRPO), avoids training a value function (critic) and reduces memory and compute overhead relative to critic-based PPO pipelines for aligning large language models. However, GRPO-style advantage estimation depends on prompt-local (within-prompt-group) reward statistics and can be unstable. In particular, when all rollouts in a prompt group receive identical rewards, the within-group reward variance becomes zero, and group normalization yields zero advantages for that group, impeding learning in cold-start regimes with binary verifiers. We introduce BV-Blend, a critic-free framework that stabilizes advantage estimation by combining prompt-local on-policy statistics with semantic-cluster-conditioned historical moments. BV-Blend maintains EMA-tracked reward moments for each cluster, derives a confidence weight from a standard error of the mean (SEM) proxy, and uses this weight to blend historical and prompt-local baseline and variance statistics into a standardized advantage for PPO-style clipped updates. Experiments on verifiable reasoning benchmarks show that BV-Blend improves training stability and performance, and remains robust in regimes where group-normalized methods may stall.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.28707 [cs.AI] |
| (or arXiv:2606.28707v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28707 arXiv-issued DOI via DataCite |
Submission history
From: Yupeng Chang [view email]
[v1]
Sat, 27 Jun 2026 03:25:53 UTC (1,292 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.