Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbolπ$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $π$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
翻译:基于流的视觉语言动作(VLA)模型在具身控制任务中表现出色,但在多步采样过程中存在难以处理的似然计算问题,这阻碍了在线强化学习的应用。我们提出了 **π-StepNFT**(步进式负感知微调),这是一种无需评论家网络和似然计算的框架,每个优化步骤仅需一次前向传播,并消除了辅助价值网络的需求。我们发现,更广阔的探索空间需要更细粒度、步进式的引导来实现对齐。实验表明,π-StepNFT 在 LIBERO 基准上释放了模型的潜在能力,并展现出具有竞争力的少样本鲁棒性。此外,它在 ManiSkill 任务上实现了卓越的泛化性能,通过防止对多模态特征的过拟合,在分布外场景中超越了基于价值函数的基线方法。这一特性为复杂的现实世界应用提供了一个有前景的可扩展解决方案。