Distilling LLM Feedback for Lean Theorem Proving

arXiv cs.AI·Gaetan Narozniak, G\'erard Biau, R\'emi Munos, Ahmad Rammal, Pierre Marion

6/1/2026

·~1 min·6/1/2026·en·1

Quick Answer

The paper introduces Feedback Distillation, enhancing Lean4 theorem-proving by enabling token-level supervision from language model feedback.

Quick Take

The paper introduces Feedback Distillation, enhancing Lean4 theorem-proving by enabling token-level supervision from language model feedback. This method improves diversity in generated trajectories over GRPO, leading to higher policy entropy and better pass@k scaling, suggesting a promising approach for complex reasoning post-training.

Key Points

Feedback Distillation provides token-level supervision using privileged feedback from language models.
The method maintains greater trajectory diversity than GRPO, enhancing policy entropy.
Combining GRPO with Feedback Distillation initialization yields superior performance.
Evaluation focused on Lean4 theorem-proving showcases improved pass@k scaling.
Addresses challenges of sparse rewards and limited exploration in existing algorithms.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2605. 30861v1 Announce Type: new Abstract: Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse.

Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge.

Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

6h ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup