Residual-MPPI: Online Policy Customization for Continuous Control

Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-tune the policy to meet the new requirements, but this often requires collecting new data with the added requirements and access to the original training metric and policy parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time which we call Residual-MPPI. It is able to customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings. Also, Residual-MPPI only requires access to the action distribution produced by the prior policy, without additional knowledge regarding the original task. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Demo videos are available on our website: https://sites.google.com/view/residual-mppi

翻译：通过强化学习（RL）与模仿学习（IL）习得的策略，已在连续控制任务中展现出实现卓越性能的巨大潜力。然而，在现实环境中，当出现原始训练阶段未预见到的额外需求时，通常需要对已训练的策略进行进一步定制。虽然可以通过微调策略来满足新需求，但这往往需要收集包含新增需求的数据，并获取原始训练指标及策略参数。相比之下，若在线规划算法能够满足额外需求，则可省去大量训练阶段，并在无需了解原始训练方案或任务详情的情况下完成策略定制。本文提出一种通用的在线规划算法，用于在执行时定制连续控制策略，我们称之为残差-MPPI。该算法能够在少量样本甚至零样本的在线设置中，针对新的性能指标对给定的先验策略进行定制。同时，残差-MPPI仅需访问先验策略产生的动作分布，无需额外了解原始任务信息。通过实验，我们证明了所提出的残差-MPPI算法能够有效完成少量样本/零样本在线策略定制任务，包括在极具挑战性的赛车场景——Gran Turismo Sport（GTS）环境中，对冠军级赛车智能体Gran Turismo Sophy（GT Sophy）1.0进行定制。演示视频请访问我们的网站：https://sites.google.com/view/residual-mppi