Imitation learning is often used in addition to reinforcement learning in environments where reward design is difficult or where the reward is sparse, but it is difficult to be able to imitate well in unknown states from a small amount of expert data and sampling data. Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift. The methods based on reinforcement learning, such as inverse reinforcement learning and Generative Adversarial imitation learning (GAIL), can learn from only a few expert data. However, they often need to interact with the environment. Soft Q imitation learning (SQIL) addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards. In order to make this algorithm more robust to distribution shift, we propose more efficient and robust algorithm by adding to this method a reward function based on adversarial inverse reinforcement learning that rewards the agent for performing actions in status similar to the demo. We call this algorithm Discriminator Soft Q Imitation Learning (DSQIL). We evaluated it on MuJoCo environments.
翻译:模仿学习通常与强化学习结合使用,适用于奖励设计困难或奖励稀疏的环境,但难以仅从少量专家数据和采样数据中在未知状态下实现良好模仿。行为克隆等监督学习方法无需采样数据,但通常存在分布偏移问题。基于强化学习的方法(如逆强化学习和生成对抗模仿学习)可从少量专家数据中学习,但往往需要与环境交互。软Q模仿学习解决了上述问题,并通过结合行为克隆与恒定奖励的软Q学习展示了高效学习能力。为使该算法对分布偏移更具鲁棒性,我们通过添加基于对抗逆强化学习的奖励函数,对处于与演示相似状态的智能体执行动作给予奖励,从而提出更高效且鲁棒的算法。我们将该算法称为判别式软Q模仿学习。我们在MuJoCo环境中对其进行了评估。