Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io
翻译:视觉-语言-动作模型在复杂机器人操作中展现了卓越的能力与强大潜力。然而,其庞大的参数量与高推理延迟阻碍了实际部署,尤其是在资源受限平台上。针对此问题,我们对VLA模型压缩进行了系统性实证研究。基于这些发现,我们提出了RLRC——一个包含结构化剪枝、通过SFT与RL的性能恢复以及后续量化的三阶段压缩与恢复流程。其中,RL阶段融入了评论家预热策略与BC损失正则化,以稳定训练并保持策略行为。RLRC在保持原始任务成功率的同时,实现了最高8倍内存缩减与2.3倍推理加速。在多种VLA骨干网络上的广泛实验表明,RLRC始终优于现有压缩基线,凸显了其在设备端部署中的有效性。项目网站:https://rlrc-vla.github.io