What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs
Quick Take
The paper introduces a method for probing LLMs to detect concepts within their embeddings, enabling monitoring of model 'thoughts.' It demonstrates the creation of linear probes for four concepts across three LLMs, paving the way for scalable concept tracking in future models.
Key Points
- Probes detect the presence of concepts in LLM embeddings.
- Method includes delineating concepts with presence and absence datasets.
- Linear probes are trained to detect concepts across LLM layers.
- Four concepts were tested on three different LLMs.
- Scalable approach allows monitoring of many concepts in new models.
Article Excerpt
From source RSS / original summaryarXiv:2605. 28823v1 Announce Type: new Abstract: As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of concepts within the embeddings computed in an LLM - which is what we might say a model is "thinking" about. Such probes should be low-cost and easily applicable to any LLM, so that monitoring for many concepts is possible during normal operation.
In this paper, we take the first steps towards developing the capability of creating many such probes by defining and executing examples of the key tasks needed: first, the careful delineation of a concept through the creation of a dataset with the concept both present and then absent. Then, the training and testing of a set of linear probes to detect the concept on any layer of an LLM, including an exploration of the complexity of the probe needed.
Finally, we show that such probes can track concepts across larger contexts. This is done with four separate concepts and three different LLMs. When this process is scaled to many more concepts, it will create the ability to easily monitor new models.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.