Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward. Such methods enable agents to learn complex tasks from humans that are difficult to capture with hand-designed reward functions. Choosing BC or IRL for imitation depends on the quality and state-action coverage of the demonstrations, as well as additional access to the Markov decision process. Hybrid strategies that combine BC and IRL are not common, as initial policy optimization against inaccurate rewards diminishes the benefit of pretraining the policy with BC. This work derives an imitation method that captures the strengths of both BC and IRL. In the entropy-regularized ('soft') reinforcement learning setting, we show that the behaviour-cloned policy can be used as both a shaped reward and a critic hypothesis space by inverting the regularized policy update. This coherency facilities fine-tuning cloned policies using the reward estimate and additional interactions with the environment. This approach conveniently achieves imitation learning through initial behaviour cloning, followed by refinement via RL with online or offline data sources. The simplicity of the approach enables graceful scaling to high-dimensional and vision-based tasks, with stable learning and minimal hyperparameter tuning, in contrast to adversarial approaches.
翻译:模仿学习方法旨在通过行为克隆(BC)策略或逆强化学习(IRL)奖励来从专家处学习。此类方法使智能体能够从人类那里学习难以用手工设计的奖励函数捕获的复杂任务。选择BC或IRL进行模仿取决于示范的质量和状态-动作覆盖范围,以及对马尔可夫决策过程的额外访问。结合BC和IRL的混合策略并不常见,因为针对不准确奖励的初始策略优化会降低用BC预训练策略的益处。本工作推导出一种捕捉BC和IRL优势的模仿方法。在熵正则化("软")强化学习设置中,我们证明,通过反转正则化策略更新,行为克隆策略可以用作塑形奖励和评论家假设空间。这种相干性便于利用奖励估计和与环境的多频交互对克隆策略进行微调。该方法通过初始行为克隆,随后利用在线或离线数据源通过RL进行精炼,从而便捷地实现模仿学习。与对抗方法相比,该方法的简洁性使其能够优雅地扩展到高维和基于视觉的任务,具有稳定的学习和最小的超参数调整。