This paper introduces DiffTORI, which utilizes Differentiable Trajectory Optimization as the policy representation to generate actions for deep Reinforcement and Imitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTORI addresses the ``objective mismatch'' issue of prior model-based RL algorithms, as the dynamics model in DiffTORI is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTORI for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feed-forward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model-based RL tasks and 35 imitation learning tasks with high-dimensional image and point cloud inputs, DiffTORI outperforms prior state-of-the-art methods in both domains.
翻译:本文提出DiffTORI,该方法利用可微分轨迹优化作为策略表示,为深度强化与模仿学习生成动作。轨迹优化是控制领域中一种强大且广泛应用的算法,其参数化依赖于成本函数与动力学函数。本方法的核心在于利用可微分轨迹优化的最新进展,该技术能够计算损失函数相对于轨迹优化参数的梯度。因此,轨迹优化的成本函数与动力学函数可实现端到端学习。DiffTORI解决了先前基于模型的强化学习算法中存在的“目标失配”问题,因为DiffTORI中的动力学模型通过沿轨迹优化过程对策略梯度损失进行微分,被直接训练以最大化任务性能。我们进一步在标准机器人操作任务集上对DiffTORI进行模仿学习基准测试,这些任务涉及高维感官观测,并将本方法与前馈策略类别以及基于能量的模型(EBM)和扩散模型进行比较。在涵盖15个基于模型的强化学习任务和35个具有高维图像与点云输入的模仿学习任务中,DiffTORI在两个领域均优于先前最先进的方法。