Likelihood-based policy gradient methods are the dominant approach for training robot control policies from rewards. These methods rely on differentiable action likelihoods, which constrain policy outputs to simple distributions like Gaussians. In this work, we show how flow matching policy gradients -- a recent framework that bypasses likelihood computation -- can be made effective for training and fine-tuning more expressive policies in challenging robot control settings. We introduce an improved objective that enables success in legged locomotion, humanoid motion tracking, and manipulation tasks, as well as robust sim-to-real transfer on two humanoid robots. We then present ablations and analysis on training dynamics. Results show how policies can exploit the flow representation for exploration when training from scratch, as well as improved fine-tuning robustness over baselines.
翻译:基于似然的策略梯度方法是目前从奖励中训练机器人控制策略的主要方法。这些方法依赖于可微分的动作似然,从而将策略输出限制为简单分布(如高斯分布)。在本工作中,我们展示了流匹配策略梯度——一个绕过似然计算的新近框架——如何能在具有挑战性的机器人控制场景中有效地训练和微调更具表达能力的策略。我们提出了一种改进的目标函数,使其能够在足式运动、人形机器人运动跟踪和操作任务中取得成功,并在两台人形机器人上实现了稳健的仿真到现实迁移。随后,我们对训练动态进行了消融实验和分析。结果表明,策略能够利用流表示在从头开始训练时进行探索,并且相较于基线方法,其微调鲁棒性也得到了提升。