Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

翻译：亿级参数的视觉-语言-动作策略在机器人操作中展现了卓越性能，但其庞大的模型规模和推理成本仍是实时闭环控制的主要障碍。我们提出**VLA-AD**蒸馏框架，利用视觉-语言模型作为离线语义监督器，将大型VLA教师模型的知识迁移至轻量级学生策略。不同于仅依赖底层动作模仿，VLA-AD在教师提供的7自由度动作目标基础上，额外注入高层语义引导信息，包括任务阶段锚点与多帧操作方向描述。这些辅助信号仅用于训练阶段：测试时，学生策略独立运行，无需VLA教师或VLM参与。我们在LIBERO三个基准套件上评估VLA-AD。以OpenVLA-7B为教师，该方法生成参数量为1.58亿的学生模型，模型体积缩减44倍，同时与教师性能的平均相对差距仅为0.27%。所得策略在RTX 4090上以12.5 Hz频率运行，推理速度较OpenVLA-7B提升3.28倍。进一步实验表明，该语义蒸馏流水线可泛化至不同的π_{0.5}-4B教师模型，学生策略在两个套件上超越教师性能，在libero_goal上性能差距仅为0.53%。附加分析表明，阶段级监督与多帧方向线索使学生策略对噪声教师动作（如错误的频繁夹爪切换）具有更低的敏感性。总体而言，VLA-AD证明VLM提供的离线语义引导能显著提升VLA策略蒸馏的效率、鲁棒性与可部署性。