To understand how people interact with each other in collaborative settings, especially in situations where individuals know little about their teammates, Multiagent Inverse Reinforcement Learning (MIRL) aims to infer the reward functions guiding the behavior of each individual given trajectories of a team's behavior during task performance. Unlike current MIRL approaches, team members \emph{are not} assumed to know each other's goals a priori, rather they collaborate by adapting to the goals of others perceived by observing their behavior, all while jointly performing a task. To address this problem, we propose a novel approach to MIRL via Theory of Mind (MIRL-ToM). For each agent, we first use ToM reasoning to estimate a posterior distribution over baseline reward profiles given their demonstrated behavior. We then perform MIRL via decentralized equilibrium by employing single-agent Maximum Entropy IRL to infer a reward function for each agent, where we simulate the behavior of other teammates according to the time-varying distribution over profiles. We evaluate our approach in a simulated 2-player search-and-rescue operation where the goal of the agents, playing different roles, is to search for and evacuate victims in the environment. Results show that the choice of baseline profiles is paramount to the recovery of ground-truth rewards, and MIRL-ToM is able to recover the rewards used by agents interacting with either known and unknown teammates.
翻译:为理解人们在协作环境中如何互动,尤其是在个体对队友了解有限的情况下,多智能体逆向强化学习(MIRL)旨在根据团队在执行任务时的行为轨迹,推断驱动每个个体行为的奖励函数。与现有MIRL方法不同,团队成员并非事先知晓彼此的目标,而是通过观察他人行为感知其目标并不断适应,同时共同完成任务。为此,我们提出一种基于心智理论(ToM)的MIRL新方法(MIRL-ToM)。对于每个智能体,首先利用ToM推理,根据其展示行为估计基准奖励配置的后验分布;随后通过分散均衡进行MIRL,采用单智能体最大熵逆向强化学习推断每个智能体的奖励函数,在此过程中依据配置的时变分布模拟其他队友的行为。我们在一个模拟的双人搜索与救援任务中评估该方法,其中扮演不同角色的智能体需在环境中搜索并疏散受害者。结果表明,基准配置的选择对恢复真实奖励至关重要,而MIRL-ToM能够成功恢复智能体在与已知或未知队友交互时所使用的奖励函数。