OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

Sarvesh Patil,Mitsuhiko Nakamoto,Manan Agarwal,Shashwat Saxena,Jesse Zhang,Giri Anantharaman,Cleah Winston,Chaoyi Pan,Douglas Chen,Nai-Chieh Huang,Zeynep Temel,Oliver Kroemer,Sergey Levine,Abhishek Gupta,Hongkai Dai,Paarth Shah,Max Simchowitz

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

翻译：生成控制策略（GCP），如基于扩散和基于流的控制策略，已成为机器人学习的有效参数化方法。本文提出离策略生成策略优化（OGPO），这是一种样本高效的GCP微调算法，通过维持离策略批评者网络以最大化数据复用，并利用修改后的PPO目标（以批评者作为终端奖励）将策略梯度传播至策略的完整生成过程。OGPO在多任务场景、高精度插入和灵巧控制等操作任务中取得了最先进的性能。据我们所知，它也是唯一能在在线重放缓冲区中无需专家数据即可将初始化不良的行为克隆策略微调至接近完全任务成功的方法，且仅需极少的任务特定超参数调整。通过大量实证研究，我们证明OGPO在策略引导和残差校正学习方面显著优于其他替代方法，并识别出其性能背后的关键机制。我们进一步引入了实用的稳定化技巧，包括成功缓冲区正则化、双边保守优势以及Q值方差缩减，以缓解基于状态和基于像素场景下的批评者过度利用问题。除提出OGPO外，我们还对GCP微调进行了系统性的实证研究，揭示了决定成功离策略全策略改进的稳定化机制与失败模式。