This paper introduces DiffTOP, which utilizes Differentiable Trajectory OPtimization as the policy representation to generate actions for deep reinforcement and imitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTOP addresses the ``objective mismatch'' issue of prior model-based RL algorithms, as the dynamics model in DiffTOP is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTOP for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feed-forward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model-based RL tasks and 13 imitation learning tasks with high-dimensional image and point cloud inputs, DiffTOP outperforms prior state-of-the-art methods in both domains.
翻译:摘要:本文提出DiffTOP方法,将可微轨迹优化(Differentiable Trajectory Optimization)作为策略表征,用于生成深度强化学习与模仿学习的动作。轨迹优化是控制领域一种强大且广泛应用的算法,其性能由代价函数与动力学函数参数化。本方法的关键在于利用可微轨迹优化的最新进展,从而能够计算损失函数相对于轨迹优化参数的梯度。由此,轨迹优化的代价函数与动力学函数可实现端到端学习。DiffTOP解决了先前基于模型的强化学习算法中存在的“目标失配”问题——其动力学模型通过将策略梯度损失反向传播至轨迹优化过程进行端到端学习,从而直接最大化任务性能。我们进一步在具有高维感知观测的标准机器人操作任务套件上对DiffTOP进行模仿学习基准测试,并将其与前馈策略类、能量基模型(EBM)及扩散模型进行对比。在15个基于模型的强化学习任务和13个高维图像/点云输入的模仿学习任务中,DiffTOP在两个领域均超越先前最先进方法。