Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. However, existing post-training methods face a dilemma between stability and exploration: Supervised Fine-Tuning (SFT) is constrained by demonstration quality and lacks generalization, whereas Reinforcement Learning (RL) improves exploration but often induces erratic, jittery trajectories that violate physical constraints. To bridge this gap, we propose SmoothVLA, a novel reinforcement learning fine-tuning framework that synergistically optimizes task performance and motion smoothness. The technical core is a physics-informed hybrid reward function that integrates binary sparse task rewards with a continuous dense term derived from trajectory jerk. Crucially, this reward is intrinsic, that computing directly from policy rollouts, without requiring extrinsic environment feedback or laborious reward engineering. Leveraging the Group Relative Policy Optimization (GRPO), SmoothVLA establishes trajectory smoothness as an explicit optimization prior, guiding the model toward physically feasible and stable control. Extensive experiments on the LIBERO benchmark demonstrate that SmoothVLA outperforms standard RL by 13.8\% in smoothness and significantly surpasses SFT in generalization across diverse tasks. Our work offers a scalable approach to aligning VLA models with physical-world constraints through intrinsic reward optimization.
翻译:视觉-语言-动作(VLA)模型已成为机器人操作领域的一种强大范式。然而,现有的后训练方法面临着稳定性与探索性之间的两难困境:监督微调(SFT)受限于演示质量且缺乏泛化能力,而强化学习(RL)虽能提升探索性,却常常产生违反物理约束的、不稳定且抖动的轨迹。为弥合这一差距,我们提出了SmoothVLA,一种新颖的强化学习微调框架,它协同优化任务性能与运动平滑性。其技术核心是一个物理信息混合奖励函数,该函数将二元稀疏任务奖励与源自轨迹加加速度的连续稠密项相结合。关键在于,此奖励是内在的,可直接从策略部署中计算得出,无需外部环境反馈或繁琐的奖励工程。利用组相对策略优化(GRPO),SmoothVLA将轨迹平滑性确立为明确的优化先验,引导模型朝向物理上可行且稳定的控制。在LIBERO基准测试上进行的大量实验表明,SmoothVLA在平滑性上优于标准RL方法13.8%,并在多样化任务中的泛化能力上显著超越SFT。我们的工作通过内在奖励优化,为将VLA模型与物理世界约束对齐提供了一种可扩展的方法。