We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.
翻译:我们提出Q引导的价值梯度匹配(Q-VGM),一种离策略强化学习方法,用于解决流匹配视觉-语言-动作(VLA)策略微调中的长期挑战:如何基于学习的Q函数高效改进表达性强的流匹配动作专家。有效改进必须利用评论家的一阶(梯度)信息,但这对于流策略而言较为困难,因为直接通过其多步去噪过程反向传播价值函数在VLA规模下存在数值不稳定问题,而策略梯度方法所需的可计算动作似然在迭代去噪过程中不可用。现有基于价值的方法要么通过完整去噪链进行反向传播,仅在使用评论家进行测试时而不更新策略,或将评论家改进后的动作作为终端标签进行蒸馏而未监督速度场。Q-VGM通过采用VGG-Flow(一种生成模型中的流对齐价值梯度视角)规避了这些问题,将价值梯度转化为去噪时间价值梯度场,而非不稳定的端到端目标。该方法无需动作似然,无需通过去噪链反向传播,并在固定经验回放缓冲池上运行。评论家是基于紧凑RLT特征与逐层动作注入的动作敏感型Cal-QL集成。Q-VGM实现了实用的少样本初始化然后从经验中学习的范式:从少样本SFT pi0.5 VLA开始,该方法利用自生成轨迹数据大幅提升任务性能,而无需额外专家监督。在LIBERO上,Q-VGM将平均成功率从75.0%提升至92.5%;在RoboTwin 2.0上,从76.4%提升至87.2%;在两项真实机器人桌面任务中,从40.0%提升至67.5%,在所有三种场景中均优于所有同骨干网络、同评论家基线方法。