Dual Dimensionality for Local and Global Attention
Quick Answer
The study introduces Distance-Adaptive Representation (DAR) for decoder-only Transformers, optimizing attention by using richer representations for local tokens and reduced dimensions for distant ones.
Quick Take
The study introduces Distance-Adaptive Representation (DAR) for decoder-only Transformers, optimizing attention by using richer representations for local tokens and reduced dimensions for distant ones. This approach maintains performance comparable to full-dimensional baselines across various model sizes (70M to 410M parameters) while enabling significant reductions in KV cache during inference.
Key Points
- DAR maintains full-dimensional representations for local tokens while reducing dimensions for distant ones.
- Performance closely matches full-dimensional baselines across models with 70M to 410M parameters.
- Uniform dimensionality reduction across tokens leads to worse performance outcomes.
- The findings challenge the assumption of uniform key and value dimensionality in attention mechanisms.
- This approach enables further reductions in KV cache during inference.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18587v1 Announce Type: new Abstract: Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens.
We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice.
We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e. g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines.
In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.