Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
Quick Answer
This paper shows that SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning) enhances mathematical reasoning by adapting self-attention models through supervised fine-tuning and reinforcement learning, significantly narrowing the performance gap between sliding-window and self-attention models.
Quick Take
SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning) enhances mathematical reasoning by adapting self-attention models through supervised fine-tuning and reinforcement learning, significantly narrowing the performance gap between sliding-window and self-attention models. Experiments show that SWARR recovers accuracy lost during conversion while maintaining linear-complexity efficiency.
Key Points
- SWARR consists of supervised fine-tuning and reinforcement learning stages.
- SWA models initially underperform compared to self-attention models after fine-tuning.
- On-policy reinforcement learning optimizes trajectories to better fit SWA constraints.
- Experiments on benchmarks demonstrate improved accuracy for SWARR over traditional methods.
- SWARR retains efficiency benefits with linear-complexity attention.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11634v1 Announce Type: new Abstract: The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning.
SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model.
Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention.
Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.