Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

翻译：强化学习（RL）为机器人操作提供了高频闭环控制能力，但由于探索效率低下和信用分配困难等原因，在稀疏奖励或不完美奖励的长时域任务中难以扩展。视觉-语言-动作（VLA）模型通过大规模多模态预训练实现了泛化型任务级推理能力，但当前局限性阻碍了其在快速精准操作任务中的直接应用。本文提出大语言-视觉-动作联合引导方法（VLAJS），该方法通过桥接稀疏VLA引导与在线策略RL，显著提升探索效率和学习性能。VLAJS将VLA作为瞬态高层动作建议源，在引导早期探索方向的同时改善信用分配，同时保留RL基于状态的高频控制特性。我们通过在近端策略优化（PPO）中引入方向性动作一致性正则化项，使得RL智能体在训练初期软对齐VLA引导动作，该方法既无需严格模仿、演示样本，也不依赖持续教师查询。VLA引导采用稀疏调度并随时间衰减退火，使智能体能够在线自适应并最终超越引导策略。我们在仿真环境中对六项挑战性操作任务（举升、抓取放置、销钉重定向、销钉插接、戳动、推挤）进行了评估，并在真实Franka Panda机器人上验证了部分任务。实验表明，VLAJS在样本效率上持续优于PPO及知识蒸馏基线，在多个任务中将所需环境交互次数降低50%以上。真实世界实验验证了零样本仿真到真机迁移能力，并在杂乱场景、物体形态变化及外部扰动下展现出鲁棒执行性能。