NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization
Quick Answer
This paper shows that The NSVQ method mitigates codebook collapse in vector quantization by addressing encoder drift, enhancing reconstruction quality on ImageNet-1k.
Quick Take
The NSVQ method mitigates codebook collapse in vector quantization by addressing encoder drift, enhancing reconstruction quality on ImageNet-1k. It reduces rFID from 2.39 to 2.10 while maintaining full codebook utilization. This approach also improves downstream generation FID in latent diffusion tasks.
Key Points
- NSVQ combines dense non-stationary embedding loss and codebook replacement strategies.
- The method stabilizes encoder drift during early training and consolidates codebook geometry.
- Experiments show NSVQ maintains full codebook utilization while improving performance.
- On ImageNet-1k, NSVQ achieves a significant reduction in rFID compared to SimVQ.
- Latent diffusion experiments indicate improved generation quality with NSVQ.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11363v1 Announce Type: new Abstract: Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator.
We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization.
On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2. 39 to 2. 10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.