Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

arXiv cs.CL·Dongxin Guo, Jikun Wu, Siu Ming Yiu

5/25/2026

·~1 min·5/25/2026·en·2

Quick Answer

This paper shows that Sparse autoencoders (SAEs) effectively map the semantic features of GPT-2 XL and Llama-3.1-8B to human brain responses, achieving 94% peak encoding performance.

Quick Take

Sparse autoencoders (SAEs) effectively map the semantic features of GPT-2 XL and Llama-3.1-8B to human brain responses, achieving 94% peak encoding performance. This study confirms a strong alignment between semantic subcategories and distinct brain regions, with results generalizing across English, Chinese, and French.

Key Points

SAEs decompose LLMs into 16K-32K interpretable features per layer.
Semantic features alone recover 94% of peak encoding performance.
Five semantic subcategories align with distinct brain regions.
SAE features predict human reading times beyond lexical controls.
Results are consistent across multiple languages including English and Chinese.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 23035v1 Announce Type: new Abstract: Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap by bridging sparse autoencoders (SAEs) from mechanistic interpretability with neural encoding models, decomposing GPT-2 XL and Llama-3. 1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy ($\kappa \geq 0.

74$) reveals that semantic features alone recover 94% of peak encoding performance ($r=0. 285$), substantially exceeding variance-matched baselines ($p<0. 001$, $d=1. 31$). Beyond this aggregate dominance, we test a novel cortical topography prediction: five semantic subcategories derived a priori from three independent neuroscience programs should map onto distinct brain regions. A formal convergence test confirms this alignment (Spearman $\rho=0. 72$, $p<0. 001$; hypergeometric $p=0.

007$), demonstrating that SAE-discovered features recapitulate known cortical semantic organization at a granularity inaccessible to prior methods. SAE features further predict human reading times beyond lexical controls ($\Delta\mathrm{logLik}=38. 4$, $p<0. 001$), and an exploratory prediction-error analysis provides preliminary evidence that the brain additionally encodes unexpected semantic content. Results generalize across English, Chinese, and French.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

2w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems