The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer
Quick Answer
This paper shows that Pruned large language models (LLMs) can excel in multiple-choice tests but fail in open-ended responses, revealing a 'benchmark illusion.' High-sparsity pruning, especially with models like Wanda, leads to recognition errors where correct answers are demoted rather than erased.
Quick Take
Pruned large language models (LLMs) can excel in multiple-choice tests but fail in open-ended responses, revealing a 'benchmark illusion.' High-sparsity pruning, especially with models like Wanda, leads to recognition errors where correct answers are demoted rather than erased. This discrepancy suggests that compressed models may not be as reliable as benchmarks indicate, necessitating evaluations based on generative capabilities.
Key Points
- Pruned models perform well in multiple-choice but struggle in open generation tasks.
- High-sparsity pruning leads to recognition errors, particularly in models like Wanda.
- Correct answers may be demoted rather than erased in pruned models.
- Multiple-choice benchmarks may overstate the effectiveness of compressed LLMs.
- Evaluations should focus on generative capabilities, not just recognition.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output?
We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. Under high-sparsity pruning, especially Wanda, models often fail in greedy open generation while still selecting the correct answer under multiple-choice scoring. In these recognition-only errors, the answer is usually not gone, but demoted: it often reappears with beam search, sampling, or one in-context example.
Overall, multiple-choice benchmarks can overstate the usability of compressed LLMs, creating an evaluation blind spot. Compressed models should be tested on what they can produce, not only on what they can recognize.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.