Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement

Yunpeng Mei,Jiakai He,Hongjie Cao,Chenyu Wang,Xiaowen Zhu,Yihan Zhou,Jiamin Wang,Chenbo Xin,Peng Cheng,Yuxuan Yang,Yijie Wang,Xinhu Zheng,Gao Huang,Jie Chen,Gang Wang

Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cloning (BC) imitates failures, filtered BC discards useful sub-trajectories, and offline reinforcement learning adds a large critic. We introduce ForesightFlow, a self-guided flow-matching policy that augments each generated action chunk with a learned success-potential trajectory. The same flow proposes and scores candidate actions, enabling best-of-$K$ inference without an external critic. The key issue is that policy improvement and value calibration require different supervision: advantage weighting should emphasize high-quality actions, but applying the same weights to potential coordinates suppresses failure gradients and creates overconfident scores. We address this with decoupled advantage-weighted flow matching, applying exponentiated advantage weights only to action velocities while training potential velocities uniformly. We further derive a one-step boundary estimator for conditional flow matching, allowing advantage computation with a single stop-gradient forward pass. Across five BEHAVIOR-1K simulation tasks and five real-world bimanual tasks, ForesightFlow improves over imitation baselines, matches the strongest separate-critic baseline in simulation success, improves real-world success, and reduces training compute by $38\%$. Ablations show that decoupling prevents value hallucination, the one-step estimator preserves candidate-ranking fidelity, and self-guided sampling improves long-horizon execution.

翻译：大型视觉-语言-动作（VLA）策略日益被训练为基于动作块序列的条件生成模型。然而，其部署会产生质量参差不齐的体验——包括成功演示、部分完成、可恢复错误及失败——这使得标准模仿学习方法难以处理。完整行为克隆（BC）会模仿失败案例，过滤式BC会丢弃有用子轨迹，而离线强化学习则需添加大型评论网络。我们提出ForesightFlow——一种自引导流匹配策略，通过为每个生成的动作块附加学习得到的成功潜力轨迹来增强其性能。同一流同时负责候选动作的生成与评分，从而无需外部评论网络即可实现最佳$K$选推断。核心问题在于：策略改进与价值校准需要不同监督信号——优势加权应强调高质量动作，但对潜力坐标施加相同权重会抑制失败梯度并导致过度自信的评分。为此我们提出解耦优势加权流匹配：仅对动作速度应用指数化优势权重，同时保持潜力速度的均匀训练。进一步推导出条件流匹配的单步边界估计器，通过单次停止梯度前向传播即可完成优势计算。在五个BEHAVIOR-1K仿真任务和五个真实世界双手操作任务中，ForesightFlow相比模仿学习基线取得提升，仿真成功率与最强分离式评论网络基线持平，真实场景成功率更高，且训练计算量降低$38\%$。消融实验表明：解耦机制可防止价值幻觉，单步估计器保持候选排序保真度，自引导采样改善长时域执行效果。