Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails
Quick Answer
The study reveals that static per-layer staggering of Fibonacci spacing in sparse attention models significantly enhances perplexity and extrapolation capabilities, outperforming learned dilations and fixed schedules.
Quick Take
The study reveals that static per-layer staggering of Fibonacci spacing in sparse attention models significantly enhances perplexity and extrapolation capabilities, outperforming learned dilations and fixed schedules. Notably, models trained with this method maintain performance even at four times their training length, while dense attention models degrade sharply. This approach is particularly relevant for language models with 60M parameters and 426M tokens.
Key Points
- Static per-layer staggering improves perplexity over fixed and learned alpha settings.
- Sparse attention models extrapolate effectively to four times their training length.
- Learning per-layer alpha increases inference latency by approximately five times.
- Best sparse model shows 26% higher perplexity than the dense baseline at training length.
- Staggering gain is uniform across context positions, not just at long range.
Paper Resources
📖 Reader Mode
~2 min readAbstract:We study sparse self-attention in which each query attends to a dense local window plus a set of Fibonacci-spaced offsets, with a per-layer scalar alpha that compresses or expands the spacing. Across 21 language models trained under one matched recipe (60M parameters, 512 hidden, 16 layers, 426M tokens), we compare four ways of setting alpha across depth: fixed, per-layer learned, a static linear stagger, and a coprime (anti-gridding) reassignment of that stagger, together with a reach-matched power-of-2 control. Three results stand out. First, a static per-layer stagger improves perplexity over both fixed and learned alpha, and the gain is base-agnostic: applying the same stagger to a power-of-2 base lifts it above fixed Fibonacci and to parity with learned Fibonacci attention. Second, learning per layer is inert: it does not beat the static schedule and costs roughly five times the inference latency. Third, and most consequential, all sparse variants extrapolate to four times their training length with little or no degradation, whereas a recipe-matched dense baseline collapses (perplexity rises by 201% at 4x length); we attribute this to fixed-offset attention only ever querying relative positions seen during training. We also report two honest negatives: at training length the best sparse model has about 26% higher perplexity than the dense baseline, and the staggering gain is uniform across context positions rather than concentrated at long range.
| Comments: | 11 pages, 5 tables |
| Subjects: | Computation and Language (cs.CL); Machine Learning (cs.LG) |
| Cite as: | arXiv:2606.28560 [cs.CL] |
| (or arXiv:2606.28560v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28560 arXiv-issued DOI via DataCite |
Submission history
From: Chad Capps [view email]
[v1]
Fri, 26 Jun 2026 19:28:48 UTC (15 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.