Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models
Quick Answer
SAGE-PTQ introduces a novel ultra-low-bit quantization framework for large language models, achieving 1.03 weight bits and 0.004 scaling bits per matrix, significantly outperforming BiLLM and PB-LLM.
Quick Take
SAGE-PTQ introduces a novel ultra-low-bit quantization framework for large language models, achieving 1.03 weight bits and 0.004 scaling bits per matrix, significantly outperforming BiLLM and PB-LLM. On LLaMA-3-8B, it achieves a perplexity of 6.74, compared to BiLLM's 55.8, while using less than 50% of BiLLM's GPU memory and demonstrating 1.5x faster decoding on LLaMA-2-70B with a single NVIDIA L40 GPU.
Key Points
- SAGE-PTQ minimizes hidden scaling costs in ultra-low-bit quantization for LLMs.
- Achieves 1.03 weight bits and 0.004 scaling bits per matrix on average.
- Outperforms BiLLM with a perplexity of 6.74 on LLaMA-3-8B.
- Uses less than 50% of BiLLM's GPU memory for similar tasks.
- Demonstrates 1.5x faster decoding on LLaMA-2-70B with NVIDIA L40 GPU.
Article Content
From source RSS / original summaryarXiv:2606. 05429v1 Announce Type: new Abstract: Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost.
SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group.
Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1. 03 weight bits and only 0. 004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6. 74 WikiText2 perplexity, compared to 55. 8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.
5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.


