E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

arXiv cs.CL·Wenjun Wang, Yanggan Gu, Shuo Cai, Yuanyi Wang, Pengkai Wang, Jianmin Wu, Hongxia Yang

5/19/2026

·~2 min·5/19/2026·en·4

Quick Answer

E-PMQ enhances post-merge quantization, improving 4-bit GPTQ performance from 65.0% to 73.6% on CLIP-ViT-B/32 and from 34.8% to 76.7% on 20-task CLIP-ViT-L/14.

Quick Take

E-PMQ enhances post-merge quantization, improving 4-bit GPTQ performance from 65.0% to 73.6% on CLIP-ViT-B/32 and from 34.8% to 76.7% on 20-task CLIP-ViT-L/14. This expert-guided framework stabilizes calibration and integrates multiple models efficiently for low-resource deployment.

Key Points

E-PMQ uses expert weights for layer-wise calibration in post-merge quantization.
Merged-weight anchoring stabilizes calibration and preserves merged model behavior.
Significant performance improvements observed across multiple task settings.
Demonstrates effective low-bit deployment for neural networks.
Addresses quantization and merging deviations in model performance.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 16 May 2026]

View PDF HTML (experimental)

Abstract:Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.16882 [cs.CL]
	(or arXiv:2605.16882v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16882 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Wenjun Wang [view email]
[v1] Sat, 16 May 2026 08:44:36 UTC (206 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems