Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward. Such methods enable agents to learn complex tasks from humans that are difficult to capture with hand-designed reward functions. Choosing BC or IRL for imitation depends on the quality and state-action coverage of the demonstrations, as well as additional access to the Markov decision process. Hybrid strategies that combine BC and IRL are not common, as initial policy optimization against inaccurate rewards diminishes the benefit of pretraining the policy with BC. This work derives an imitation method that captures the strengths of both BC and IRL. In the entropy-regularized ('soft') reinforcement learning setting, we show that the behaviour-cloned policy can be used as both a shaped reward and a critic hypothesis space by inverting the regularized policy update. This coherency facilitates fine-tuning cloned policies using the reward estimate and additional interactions with the environment. This approach conveniently achieves imitation learning through initial behaviour cloning, followed by refinement via RL with online or offline data sources. The simplicity of the approach enables graceful scaling to high-dimensional and vision-based tasks, with stable learning and minimal hyperparameter tuning, in contrast to adversarial approaches. For the open-source implementation and simulation results, see https://joemwatson.github.io/csil/.
翻译:模仿学习方法旨在通过行为克隆(BC)策略或逆向强化学习(IRL)奖励从专家处学习。此类方法使智能体能够从人类那里学习难以通过手工设计的奖励函数捕捉的复杂任务。选择BC还是IRL进行模仿取决于示范数据的质量与状态-动作覆盖范围,以及对马尔可夫决策过程的额外访问权限。将BC与IRL结合的混合策略并不常见,因为针对不精确奖励的初始策略优化会削弱使用BC预训练策略的收益。本研究推导出一种融合BC与IRL优势的模仿方法。在熵正则化(“软”)强化学习设置中,我们证明通过逆向正则化策略更新,可将行为克隆策略同时用作塑形奖励和批评家假设空间。这种连贯性使得能够利用奖励估计及与环境的额外交互对克隆策略进行微调。该方法通过初始行为克隆后利用在线或离线数据源的RL进行优化,便捷地实现了模仿学习。与对抗式方法相比,该方法的简洁性使其能优雅扩展到高维和基于视觉的任务,且具有稳定的学习过程和极少的超参数调优需求。开源实现与仿真结果详见https://joemwatson.github.io/csil/。