Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models
Quick Answer
This paper shows that Causal Attribution Pruning (CAP) enhances reasoning performance in large language models like Llama-3 and Mistral-7B, achieving up to 61% accuracy gains over Wanda on ARC-Challenge at 20% sparsity.
Quick Take
Causal Attribution Pruning (CAP) enhances reasoning performance in large language models like Llama-3 and Mistral-7B, achieving up to 61% accuracy gains over Wanda on ARC-Challenge at 20% sparsity. CAP identifies critical attention heads based on their causal impact, outperforming traditional pruning methods in preserving performance while reducing inference costs.
Key Points
- CAP estimates performance degradation by masking attention heads during reasoning tasks.
- Achieved relative accuracy gains of up to 61% on ARC-Challenge at 20% sparsity.
- Evaluated on GSM8K, StrategyQA, and ARC-Challenge with Llama-3 and Mistral-7B.
- CAP outperforms magnitude-only and activation-based pruning methods.
- Performance improvements are especially notable at moderate sparsity levels (10-20%).
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19350v1 Announce Type: new Abstract: Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning.
For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity.
We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3.
Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.