Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.
翻译:利用视觉语言模型(VLM)原生自回归范式的视觉-语言-动作(VLA)模型,已展现出卓越的指令跟随能力和训练效率。该范式的核心在于动作分词化,然而其设计长期以来主要聚焦于重构保真度,未能解决其对VLA优化的直接影响。因此,关于\textit{优秀动作分词器的构成要素}这一根本问题仍未得到解答。本文通过从VLA优化的视角建立专门的设计原则来弥合这一差距。基于信息论视角,我们识别出一组最佳实践,包括最大化时间令牌重叠、最小化词汇冗余、增强多模态互信息以及确保令牌独立性。在这些原则指导下,我们提出了\textbf{ActionCodec}——一种高性能动作分词器,它能显著提升训练效率,并在多样化的仿真与真实世界基准测试中增强VLA性能。值得注意的是,在LIBERO基准上,使用ActionCodec微调的SmolVLM2-2.2B模型在未经任何机器人预训练的情况下实现了95.5\%的成功率。通过引入先进的架构增强,该指标进一步提升至97.4\%,这代表了无需机器人预训练的VLA模型的新SOTA水平。我们相信,所建立的设计原则与开源模型将为学界开发更高效的动作分词器提供清晰的路线图。