BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards

arXiv cs.AI·Yupeng Chang, Yuan Wu, Yi Chang

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

Quick Take

BV-Blend introduces a critic-free reinforcement learning framework that stabilizes advantage estimation by blending prompt-local statistics with historical moments, enhancing training stability and performance in cold-start scenarios. It addresses the instability of Group Relative Policy Optimization (GRPO) when rewards are identical across rollouts, improving robustness in verifiable reasoning benchmarks.

Key Points

BV-Blend combines on-policy statistics with historical moments for stable advantage estimation.
The framework improves upon GRPO, which struggles with identical rewards in rollouts.
Experiments demonstrate enhanced training stability and performance on reasoning benchmarks.
BV-Blend uses EMA-tracked moments and confidence weights for standardized advantage.
It remains robust in scenarios where group-normalized methods typically stall.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

View PDF HTML (experimental)

Abstract:Critic-free reinforcement learning with verifiable rewards (RLVR), exemplified by Group Relative Policy Optimization (GRPO), avoids training a value function (critic) and reduces memory and compute overhead relative to critic-based PPO pipelines for aligning large language models. However, GRPO-style advantage estimation depends on prompt-local (within-prompt-group) reward statistics and can be unstable. In particular, when all rollouts in a prompt group receive identical rewards, the within-group reward variance becomes zero, and group normalization yields zero advantages for that group, impeding learning in cold-start regimes with binary verifiers. We introduce BV-Blend, a critic-free framework that stabilizes advantage estimation by combining prompt-local on-policy statistics with semantic-cluster-conditioned historical moments. BV-Blend maintains EMA-tracked reward moments for each cluster, derives a confidence weight from a standard error of the mean (SEM) proxy, and uses this weight to blend historical and prompt-local baseline and variance statistics into a standardized advantage for PPO-style clipped updates. Experiments on verifiable reasoning benchmarks show that BV-Blend improves training stability and performance, and remains robust in regimes where group-normalized methods may stall.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.28707 [cs.AI]
	(or arXiv:2606.28707v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.28707 arXiv-issued DOI via DataCite

Submission history

From: Yupeng Chang [view email]
[v1] Sat, 27 Jun 2026 03:25:53 UTC (1,292 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

5d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy