This paper introduces DiffTOP, which utilizes Differentiable Trajectory OPtimization as the policy representation to generate actions for deep reinforcement and imitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTOP addresses the ``objective mismatch'' issue of prior model-based RL algorithms, as the dynamics model in DiffTOP is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTOP for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feed-forward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model-based RL tasks and 35imitation learning tasks with high-dimensional image and point cloud inputs, DiffTOP outperforms prior state-of-the-art methods in both domains.
翻译:本文提出DiffTOP,该方法利用可微分轨迹优化作为策略表示,为深度强化学习和模仿学习生成动作。轨迹优化是控制领域中一种强大且广泛应用的算法,其参数化依赖于成本函数和动力学函数。本方法的关键在于利用可微分轨迹优化领域的最新进展,使得能够计算损失函数相对于轨迹优化参数的梯度。因此,轨迹优化的成本函数和动力学函数可以进行端到端学习。DiffTOP解决了先前基于模型的强化学习算法中存在的“目标不匹配”问题,因为DiffTOP中的动力学模型通过轨迹优化过程对策略梯度损失进行微分,从而直接以最大化任务性能为目标进行学习。我们进一步在具有高维感知观测的标准机器人操作任务集上对DiffTOP进行模仿学习基准测试,并将本方法与前馈策略类以及基于能量的模型和扩散模型进行比较。在15个基于模型的强化学习任务和35个具有高维图像与点云输入的模仿学习任务中,DiffTOP在两个领域均优于先前的最先进方法。