The growing demand for real-time robotic deployment necessitates fast and on-device inference for vision-language-action (VLA) models. Within the VLA literature, efficiency has been extensively studied at the token level, such as visual token pruning. In contrast, systematic transformer layer reduction has received limited attention and, to the best of our knowledge, has not been explored for flow-based VLA models under knowledge distillation. In this work, we propose Shallow-pi, a principled knowledge distillation framework that aggressively reduces the transformer depth of both the VLM backbone and the flow-based action head, compressing the model from 18 to 6 layers. Shallow-pi achieves over two times faster inference with less than one percent absolute drop in success rate on standard manipulation benchmarks, establishing state-of-the-art performance among reduced VLA models. Crucially, we validate our approach through industrial-scale real-world experiments on Jetson Orin and Jetson Thor across multiple robot platforms, including humanoid systems, in complex and dynamic manipulation scenarios.
翻译:对机器人实时部署日益增长的需求,要求视觉-语言-动作(VLA)模型具备快速且可在设备上运行的推理能力。在VLA研究领域,效率问题已在令牌层面(如视觉令牌剪枝)得到广泛研究。相比之下,系统的Transformer层数缩减受到的关注有限,并且据我们所知,尚未在知识蒸馏框架下针对基于流的VLA模型进行探索。本研究提出Shallow-π,一个原则性的知识蒸馏框架,该框架大幅削减了VLM主干网络和基于流的动作头中的Transformer深度,将模型从18层压缩至6层。Shallow-π在标准操作基准测试中实现了超过两倍的推理加速,且成功率绝对下降幅度小于百分之一,在精简版VLA模型中确立了最先进的性能。至关重要的是,我们在Jetson Orin和Jetson Thor平台上,通过跨多个机器人平台(包括人形机器人系统)的工业级真实世界实验,在复杂动态的操作场景中验证了我们的方法。