Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning
Quick Answer
The paper introduces DiRL, a Direction-Aware Reinforcement Learning framework that enhances exploration in large language models by distinguishing between reasoning and memorization.
Quick Take
The paper introduces DiRL, a Direction-Aware Reinforcement Learning framework that enhances exploration in large language models by distinguishing between reasoning and memorization. By focusing on reasoning-aligned exploration, DiRL shows significant improvements in mathematical and general reasoning benchmarks compared to existing methods. This approach integrates with Group Relative Policy Optimization (GRPO) and effectively suppresses memorization-driven variations.
Key Points
- DiRL anchors exploration to a reasoning-memorization direction extracted from model representations.
- The framework constructs direction-weighted gradient features for rollout updates.
- DiRL amplifies reasoning-aligned exploration while suppressing memorization variations.
- Extensive experiments show DiRL's effectiveness over various exploration methods.
- DiRL integrates seamlessly with standard Group Relative Policy Optimization (GRPO).
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10346v1 Announce Type: new Abstract: Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts.
Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy.
Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO).
Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.