Hint-Guided Diversified Policy Optimization for LLM Reasoning
Quick Take
The proposed Hint-Guided Diversified Policy Optimization (HDPO) enhances LLM reasoning by allowing models to generate diverse candidate solutions before selecting the most reliable one. This two-stage process includes Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning, leading to improved solution diversity and reliability. Experimental results demonstrate significant advancements in LLM performance compared to traditional RLVR methods.
Key Points
- HDPO allows models to list potential solutions before selection.
- The approach includes Cold Start and Hint-Guided Reinforcement Learning.
- Experimental results show enhanced diversity in candidate solutions.
- Models using HDPO identify reliable solutions more effectively.
- This method contrasts with traditional RLVR's outcome-level focus.
Article Content
From source RSS / original summaryarXiv:2606. 03021v1 Announce Type: new Abstract: Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions.
In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning.
HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.