Cross-Entropy Games and Frost Training · DeepSignal

Cross-Entropy Games and Frost Training

arXiv cs.AI·Arthur Renard, Franck Gabriel, Valentin Hartmann, Cl\'ement Hongler

2d ago

·~1 min·5/28/2026·en·0

Quick Take

Frost Training enhances Monte Carlo-based policy optimization in Cross-Entropy Games by utilizing reward function gradients in embedding space. This method, validated through GRPO training, significantly boosts model output quality and training speed, achieving higher maximum scores in best-of-k settings.

Key Points

Frost Training improves Monte Carlo-based policy optimization for LLM-as-a-judge tasks.
Utilizes gradients of the reward function in embedding space for enhanced training.
Demonstrates increased model output quality and faster training speeds.
Achieves higher maximum scores in best-of-k scenarios.
Validated through GRPO training for maximum-likelihood infilling.

Article Excerpt

From source RSS / original summary

arXiv:2605. 27701v1 Announce Type: new Abstract: We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training.

We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model's ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone

1d ago

FeaturedOriginal

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

AI Summary

The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.

#Agent #Robotics #Security #Policy