Inverse reinforcement learning (IRL) methods assume that the expert data is generated by an agent optimizing some reward function. However, in many settings, the agent may optimize a reward function subject to some constraints, where the constraints induce behaviors that may be otherwise difficult to express with just a reward function. We consider the setting where the reward function is given, and the constraints are unknown, and propose a method that is able to recover these constraints satisfactorily from the expert data. While previous work has focused on recovering hard constraints, our method can recover cumulative soft constraints that the agent satisfies on average per episode. In IRL fashion, our method solves this problem by adjusting the constraint function iteratively through a constrained optimization procedure, until the agent behavior matches the expert behavior. We demonstrate our approach on synthetic environments, robotics environments and real world highway driving scenarios.
翻译:逆强化学习(IRL)方法假设专家数据是由优化某个奖励函数的智能体生成的。然而,在许多场景中,智能体可能在满足某些约束的条件下优化奖励函数,这些约束所引发的行为可能难以仅通过奖励函数来表达。我们考虑奖励函数已知而约束未知的设置,并提出一种方法,该方法能够从专家数据中令人满意地恢复这些约束。先前的工作主要集中于恢复硬约束,而我们的方法能够恢复智能体每回合平均满足的累积软约束。遵循IRL的思路,我们的方法通过约束优化过程迭代调整约束函数,直到智能体行为与专家行为匹配,从而解决该问题。我们在合成环境、机器人环境以及真实世界高速公路驾驶场景中验证了我们的方法。