Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Quick Take
The study introduces the Bank of Values (BoV) method, which allows deeper layers of transformers to learn context-free value vectors, enhancing model performance without relying on the residual stream. BoV shows improved validation loss and benchmark scores across 135M and 780M models, matching previous best methods with reduced compute and memory requirements.
Key Points
- BoV enables context-free value vectors in deeper transformer layers.
- Validation loss improves significantly over standard attention mechanisms.
- Achieves benchmark scores matching previous best methods with less compute.
- Applicable to models with 135M and 780M parameters.
- Context-free vectors can be stored as sparse model parameters.
Article Content
From source RSS / original summaryarXiv:2606. 02780v1 Announce Type: new Abstract: The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context-dependent query, key, and value vectors.
However, we find that model performance meaningfully improves when deeper layers learn only a context-free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context-free value vector, adding back the context-dependent component provides little additional benefit for aggregate benchmark performance.
Such context-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context-free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token-specific value vectors for each of the last third of layers.
Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.