Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
Quick Answer
This paper shows that A new end-to-end framework for Large Language Models (LLMs) combines mixed-precision quantization and structural pruning, achieving up to 21% lower perplexity on WikiText at ultra-low precisions (1-3 bits).
Quick Take
A new end-to-end framework for Large Language Models (LLMs) combines mixed-precision quantization and structural pruning, achieving up to 21% lower perplexity on WikiText at ultra-low precisions (1-3 bits). This method outperforms state-of-the-art techniques, reducing perplexity by up to 59% and 85% on WikiText and C4, respectively, while enhancing reasoning performance.
Key Points
- Introduces a mixed-precision PTQ strategy minimizing global error propagation.
- Joint optimization learns pruning and quantization policies simultaneously.
- Achieves up to 21% lower perplexity on WikiText compared to SoTA.
- Reduces perplexity by up to 59% on WikiText and 85% on C4.
- Delivers superior reasoning performance at ultra-low bit rates.
Article Content
From source RSS / original summaryarXiv:2606. 07819v1 Announce Type: new Abstract: Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions.
Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub-optimality. We introduce a novel end-to-end framework that addresses these limitations in two key ways. First, we propose a novel mixed-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer-wise errors.
Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed-precision quantization policies within a unified search space. Extensive experiments show that, at ultra-low precisions (1-3 bits), our quantization method reduces WikiText perplexity by up to 21% compared to state-of-the-art (SoTA) weight-activation quantization baselines.
Against leading weight-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively. Compared to the SoTA joint pruning-and-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra-low bits.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.