Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution
Quick Answer
This paper shows that The introduction of turn-averaged sparse autoencoders (SAEs) enhances feature extraction in language models by averaging activations over entire turns, simplifying long-context analysis and improving interpretability.
Quick Take
The introduction of turn-averaged sparse autoencoders (SAEs) enhances feature extraction in language models by averaging activations over entire turns, simplifying long-context analysis and improving interpretability. This method outperforms traditional per-token features in capturing high-level characteristics of dialogue turns, facilitating easier attribution graph creation.
Key Points
- Turn-averaged SAEs represent dialogue turns with fixed feature counts.
- They improve the interpretability of language models at long context lengths.
- Averaged features provide a more complete description than per-token features.
- This approach simplifies the creation of attribution graphs for downstream tasks.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Sparse autoencoders (SAEs) have become a useful tool for extracting interpretable features in language models. However, standard SAE architectures operate on individual token activations, meaning that the number of active features scales linearly with context length, and studying long model transcripts becomes difficult. We introduce turn-averaged SAEs, which represent a single Human or Assistant turn with a fixed number of features by learning to reconstruct the average model activation across the turn. We find that turn-averaged features describe a single turn's high-level characteristics more completely than per-token features when judged by an LLM. We also demonstrate that turn-averaged SAEs greatly simplify common downstream uses of SAEs like attribution graphs. Broadly, turn-averaged SAEs make interpretability techniques practical at long context lengths.
| Subjects: | Computation and Language (cs.CL); Machine Learning (cs.LG) |
| Cite as: | arXiv:2606.28548 [cs.CL] |
| (or arXiv:2606.28548v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28548 arXiv-issued DOI via DataCite |
Submission history
From: Kevin Der [view email]
[v1]
Fri, 26 Jun 2026 19:07:34 UTC (1,684 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.