A differential dynamic programming (DDP)-based framework for inverse reinforcement learning (IRL) is introduced to recover the parameters in the cost function, system dynamics, and constraints from demonstrations. Different from existing work, where DDP was used for the inner forward problem with inequality constraints, our proposed framework uses it for efficient computation of the gradient required in the outer inverse problem with equality and inequality constraints. The equivalence between the proposed method and existing methods based on Pontryagin's Maximum Principle (PMP) is established. More importantly, using this DDP-based IRL with an open-loop loss function, a closed-loop IRL framework is presented. In this framework, a loss function is proposed to capture the closed-loop nature of demonstrations. It is shown to be better than the commonly used open-loop loss function. We show that the closed-loop IRL framework reduces to a constrained inverse optimal control problem under certain assumptions. Under these assumptions and a rank condition, it is proven that the learning parameters can be recovered from the demonstration data. The proposed framework is extensively evaluated through four numerical robot examples and one real-world quadrotor system. The experiments validate the theoretical results and illustrate the practical relevance of the approach.
翻译:本文提出了一种基于微分动态规划(DDP)的逆强化学习(IRL)框架,用于从演示数据中恢复代价函数、系统动力学及约束中的参数。与现有工作中将DDP用于处理带不等式约束的内部前向问题不同,本框架将其用于高效计算带等式与不等式约束的外部逆问题所需的梯度。我们建立了所提方法与基于庞特里亚金极大值原理(PMP)的现有方法之间的等价性。更重要的是,通过将此基于DDP的IRL与开环损失函数结合,我们提出了一种闭环IRL框架。该框架提出了一种损失函数以捕捉演示数据的闭环特性,并被证明优于常用的开环损失函数。我们证明,在特定假设下,该闭环IRL框架可简化为一个约束逆最优控制问题。在这些假设及一个秩条件成立时,我们证明了学习参数可从演示数据中恢复。所提框架通过四个机器人数值算例和一个真实世界四旋翼系统进行了全面评估。实验结果验证了理论结论,并说明了该方法的实际应用价值。