Learning a reward function from demonstrations suffers from low sample-efficiency. Even with abundant data, current inverse reinforcement learning methods that focus on learning from a single environment can fail to handle slight changes in the environment dynamics. We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, with the former selecting environments to identify the reward function as quickly as possible from the expert's demonstrations in said environments. This results in improvements in both sample-efficiency and robustness, as we show experimentally, for both exact and approximate inference.
翻译:从示范中学习奖励函数存在样本效率低下的问题。即使数据充足,当前专注于从单一环境中学习的逆强化学习方法在面对环境动态的细微变化时仍可能失效。我们通过自适应环境设计来应对这些挑战。在该框架中,学习者与专家反复交互,前者通过选择环境,尽可能快速地根据专家在这些环境中的示范来识别奖励函数。实验结果表明,无论是精确推理还是近似推理,该方法在样本效率和鲁棒性上均有提升。