QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization

The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.

翻译：视觉-语言-动作（VLA）模型的出现标志着具身智能领域的重大飞跃，但其巨大的计算需求严重阻碍了在资源受限机器人平台上的部署。直观而言，低比特量化是大规模模型压缩中普遍且优选的技术。然而，我们发现目前严重缺乏对VLA模型量化的系统性分析。我们认为，将大型语言模型（LLMs）中均匀比特量化的方法直接迁移到机器人领域存在根本缺陷，因为这些方法优先考虑被动数据保真度，却忽略了微小的动作偏差如何累积成灾难性的任务失败。为填补这一空白，我们提出了QVLA——首个专为具身控制设计的以动作为中心的量化框架。与基于LLM方法的刚性均匀比特量化截然不同，QVLA引入了一种高度细粒度的通道级比特分配策略。其核心机制是直接量化每个独立通道至不同比特宽度时，测量最终动作空间的敏感性。该过程生成精确的逐通道重要性度量，用以指导全局优化，从而将量化与剪枝（0比特）优雅地统一到单一连贯的框架中。在不同基准上的广泛评估证明了我们方法的优越性。在LIBERO基准中，采用我们方法的OpenVLA-OFT量化版本仅需原模型29.2%的显存，在保持98.9%原始性能的同时实现了1.49倍的加速，这相当于比源自LLM的方法SmoothQuant提升了22.6%的性能。我们的工作为机器人领域VLA模型的压缩建立了新的原则性基础，为在现实硬件上部署强大、大规模模型铺平了道路。代码将公开发布。