Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

arXiv cs.CL·Kevin Der, Harish Kamath, Ben Thompson

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

This paper shows that The introduction of turn-averaged sparse autoencoders (SAEs) enhances feature extraction in language models by averaging activations over entire turns, simplifying long-context analysis and improving interpretability.

Quick Take

The introduction of turn-averaged sparse autoencoders (SAEs) enhances feature extraction in language models by averaging activations over entire turns, simplifying long-context analysis and improving interpretability. This method outperforms traditional per-token features in capturing high-level characteristics of dialogue turns, facilitating easier attribution graph creation.

Key Points

Turn-averaged SAEs represent dialogue turns with fixed feature counts.
They improve the interpretability of language models at long context lengths.
Averaged features provide a more complete description than per-token features.
This approach simplifies the creation of attribution graphs for downstream tasks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 26 Jun 2026]

View PDF HTML (experimental)

Abstract:Sparse autoencoders (SAEs) have become a useful tool for extracting interpretable features in language models. However, standard SAE architectures operate on individual token activations, meaning that the number of active features scales linearly with context length, and studying long model transcripts becomes difficult. We introduce turn-averaged SAEs, which represent a single Human or Assistant turn with a fixed number of features by learning to reconstruct the average model activation across the turn. We find that turn-averaged features describe a single turn's high-level characteristics more completely than per-token features when judged by an LLM. We also demonstrate that turn-averaged SAEs greatly simplify common downstream uses of SAEs like attribution graphs. Broadly, turn-averaged SAEs make interpretability techniques practical at long context lengths.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2606.28548 [cs.CL]
	(or arXiv:2606.28548v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2606.28548 arXiv-issued DOI via DataCite

Submission history

From: Kevin Der [view email]
[v1] Fri, 26 Jun 2026 19:07:34 UTC (1,684 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems