Cross-Entropy Games and Frost Training
Quick Take
Frost Training enhances Monte Carlo-based policy optimization in Cross-Entropy Games by utilizing reward function gradients in embedding space. This method, validated through GRPO training, significantly boosts model output quality and training speed, achieving higher maximum scores in best-of-k settings.
Key Points
- Frost Training improves Monte Carlo-based policy optimization for LLM-as-a-judge tasks.
- Utilizes gradients of the reward function in embedding space for enhanced training.
- Demonstrates increased model output quality and faster training speeds.
- Achieves higher maximum scores in best-of-k scenarios.
- Validated through GRPO training for maximum-likelihood infilling.
Article Excerpt
From source RSS / original summaryarXiv:2605. 27701v1 Announce Type: new Abstract: We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training.
We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model's ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.