When deploying artificial agents in real-world environments where they interact with humans, it is crucial that their behavior is aligned with the values, social norms or other requirements of that environment. However, many environments have implicit constraints that are difficult to specify and transfer to a learning agent. To address this challenge, we propose a novel method that utilizes the principle of maximum causal entropy to learn constraints and an optimal policy that adheres to these constraints, using demonstrations of agents that abide by the constraints. We prove convergence in a tabular setting and provide an approximation which scales to complex environments. We evaluate the effectiveness of the learned policy by assessing the reward received and the number of constraint violations, and we evaluate the learned cost function based on its transferability to other agents. Our method has been shown to outperform state-of-the-art approaches across a variety of tasks and environments, and it is able to handle problems with stochastic dynamics and a continuous state-action space.
翻译:在现实环境中部署与人类交互的人工智能体时,其行为必须与环境的价值观、社会规范或其他要求保持一致。然而,许多环境存在难以指定并传递给学习智能体的隐含约束。为应对这一挑战,我们提出了一种新方法,利用最大因果熵原理,通过观察遵循约束的智能体演示,学习这些约束以及一个符合约束的最优策略。我们在表格场景下证明了收敛性,并提供了一种可扩展至复杂环境的近似方法。我们通过评估获得的奖励和违反约束的次数来评估所学策略的有效性,并基于所学代价函数向其他智能体的迁移能力对其进行评估。实验表明,我们的方法在多种任务和环境中均优于现有最先进方法,且能够处理具有随机动力学和连续状态动作空间的问题。