Localizing Anchoring Pathways in Language Models
Quick Answer
This study reveals that irrelevant numbers in prompts can influence language model judgments, specifically in numerical reasoning, by analyzing anchoring effects in models like Qwen and Llama.
Quick Take
This study reveals that irrelevant numbers in prompts can influence language model judgments, specifically in numerical reasoning, by analyzing anchoring effects in models like Qwen and Llama. Using logit-difference metrics and circuit localization, it finds that edge-level methods better capture anchoring signals, indicating shared pathways within models but inconsistent transfer between base and instruction-tuned variants.
Key Points
- Irrelevant numbers in prompts create anchoring effects in language model judgments.
- Logit-difference metrics validate behavioral anchoring in models like Qwen and Llama.
- Edge-level methods recover anchoring signals more accurately than node-level methods.
- Low- and high-anchor circuits show strong transfer within models, indicating shared pathways.
- Sparse transfer across base and instruction-tuned variants suggests pathway changes post-training.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 12818v1 Announce Type: new Abstract: Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring.
Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most.
Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.