The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

arXiv cs.CL·Rui Wen, Lu Sun, Jiayang Liu, Zesheng Xu, Tianshuo Cong, Zheng Li

6/17/2026

·~1 min·6/17/2026·en·0

Quick Answer

This paper shows that Pruned large language models (LLMs) can excel in multiple-choice tests but fail in open-ended responses, revealing a 'benchmark illusion.' High-sparsity pruning, especially with models like Wanda, leads to recognition errors where correct answers are demoted rather than erased.

Quick Take

This discrepancy suggests that compressed models may not be as reliable as benchmarks indicate, necessitating evaluations based on generative capabilities.

Key Points

Pruned models perform well in multiple-choice but struggle in open generation tasks.
High-sparsity pruning leads to recognition errors, particularly in models like Wanda.
Correct answers may be demoted rather than erased in pruned models.
Multiple-choice benchmarks may overstate the effectiveness of compressed .
Evaluations should focus on generative capabilities, not just recognition.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 17609v1 Announce Type: new Abstract: Compressing reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output?

We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

1w ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis