ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models
Quick Take
ActQuant introduces a sub-4-bit action-guided quantization framework for Vision-Language-Action models, achieving 95.0% performance on OpenVLA-OFT at 3 bits-per-weight. The framework compresses model size from 14.3 GB to 2.7 GB, while maintaining success rates on a physical UR3 arm.
Key Points
- ActQuant uses a two-stage mixed-precision PTQ framework for efficient quantization.
- Achieves 2.5 bits-per-weight with 90.1% performance on OpenVLA-OFT.
- OmniModel.cpp enables deployment of quantized models in native C/C++ runtime.
- Quantization reduces memory footprint by 2.5 times on the physical UR3 arm.
- ActQuant is the only method achieving performance at or below 3 bits-per-weight.
Article Content
From source RSS / original summaryarXiv:2605. 24011v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime.
To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control.
To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel. cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel. cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95. 0% on OpenVLA-OFT and 94. 8% on $\pi_{0. 5}$.
Pushed further, ActQuant reaches 2. 5 bpw at 90. 1% on OpenVLA-OFT, compressing the backbone from 14. 3 GB to 2. 7 GB (5. 3$\times$). On the physical UR3 arm, $\pi_{0. 5}$ quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2. 5$\times$.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
