Human demonstration data is often ambiguous and incomplete, motivating imitation learning approaches that also exhibit reliable planning behavior. A common paradigm to perform planning-from-demonstration involves learning a reward function via Inverse Reinforcement Learning (IRL) then deploying this reward via Model Predictive Control (MPC). Towards unifying these methods, we derive a replacement of the policy in IRL with a planning-based agent. With connections to Adversarial Imitation Learning, this formulation enables end-to-end interactive learning of planners from observation-only demonstrations. In addition to benefits in interpretability, complexity, and safety, we study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness. The study includes evaluations in both simulated control benchmarks and real-world navigation experiments using few-to-single observation-only demonstrations.
翻译:人类示范数据通常存在模糊性与不完整性,这促使模仿学习方法需同时展现出可靠的规划行为。实现从示范中规划的常见范式包括:通过逆向强化学习学习奖励函数,随后借助模型预测控制部署该奖励函数。为统一这些方法,我们推导出用基于规划的智能体替代逆向强化学习中策略的框架。通过与对抗模仿学习的关联,该公式实现了从纯观测示范中进行端到端交互式规划器学习。除在可解释性、复杂性和安全性方面的优势外,我们研究并观察到该方法在样本效率、分布外泛化能力和鲁棒性上的显著提升。研究涵盖模拟控制基准测试和使用少量至单次纯观测示范的真实世界导航实验评估。