Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.
翻译:尽管预训练的视觉-语言-动作(VLA)模型在机器人操作任务中展现出令人印象深刻的泛化能力,但在部署阶段仍需通过后训练确保其可靠性能。然而,标准的离线监督微调(SFT)存在分布偏移和预训练能力灾难性遗忘的问题,而在线强化学习(RL)则面临稀疏奖励和样本效率低下的困境。本文提出在线策略VLA蒸馏(VLA-OPD)框架,该框架融合了SFT的高效性与RL的鲁棒性。VLA-OPD不依赖稀疏的环境奖励,而是利用专家教师模型对学生模型自生成的轨迹提供密集的逐token级监督。这种方法能在教师策略生成的状态上实现主动纠错,同时通过温和对齐保留预训练的通用能力。关键之处在于,我们通过反向KL散度目标函数构建VLA-OPD。与标准前向KL散度导致模式覆盖性熵爆炸、或硬交叉熵(Hard-CE)导致过早熵坍缩不同,我们提出的有界模式追求目标函数通过过滤教师模型的认知不确定性,同时保持动作多样性,从而实现稳定的策略学习。在LIBERO和RoboTwin2.0基准上的实验表明,VLA-OPD相较于RL显著提升了样本效率,相较于SFT增强了鲁棒性,并有效缓解了后训练过程中的灾难性遗忘问题。