Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.
翻译:摘要:近期视觉-语言-动作(VLA)模型展现出卓越的灵活性与泛化能力,但其在机器人操控领域的部署仍受制于高计算开销与推理延迟。本文提出ActDistill——一种通用动作引导自蒸馏框架,可将任意现有VLA模型的动作预测能力迁移至轻量化模型。不同于以往侧重视觉-语言关联的效率优化策略,ActDistill利用动作先验知识引导知识迁移与模型压缩,实现面向动作的VLA模型效率提升。具体而言,我们以训练完备的VLA模型作为教师网络,引入图结构封装策略显式建模动作预测的层次演化过程。源自图封装教师网络的学生模型进一步配备动态路由器,该路由器根据动作预测需求自适应选择计算路径,并通过层次化图感知监督实现平滑高效演化。推理时移除图相关辅助组件,学生模型仅执行动态路由层,以极低计算量与延迟完成高精度动作预测。在具身智能基准上的实验表明,ActDistill在计算量降低超50%且加速比达1.67倍的情况下,仍能达到与完整VLA模型相当或更优的性能,从而为高效具身智能建立了通用范式。