Inverse Reinforcement Learning (IRL) techniques deal with the problem of deducing a reward function that explains the behavior of an expert agent who is assumed to act optimally in an underlying unknown task. In several problems of interest, however, it is possible to observe the behavior of multiple experts with different degree of optimality (e.g., racing drivers whose skills ranges from amateurs to professionals). For this reason, in this work, we extend the IRL formulation to problems where, in addition to demonstrations from the optimal agent, we can observe the behavior of multiple sub-optimal experts. Given this problem, we first study the theoretical properties of the class of reward functions that are compatible with a given set of experts, i.e., the feasible reward set. Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards. Furthermore, we study the statistical complexity of estimating the feasible reward set with a generative model. To this end, we analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to the one of the optimal agent.
翻译:逆强化学习技术解决的是在假设专家代理者以最优方式执行底层未知任务时,推断能解释其行为模式的奖励函数问题。然而在多个实际场景中,我们可能观察到不同最优性程度的多个专家行为(例如从业余到专业水平不等的赛车手)。为此,本研究将IRL框架拓展至除最优代理者示范外,还能观测到多位次优专家行为的问题场景。针对该问题,我们首先研究了与给定专家集兼容的奖励函数类(即可行奖励集)的理论特性。研究结果表明,多位次优专家的存在能显著缩小兼容奖励集的范围。此外,我们分析了基于生成模型估计可行奖励集的统计复杂度,并为此提出了均匀采样算法。分析表明,当次优专家的性能水平足够接近最优代理者时,该算法可达到极小化最优性。