In Imitation Learning (IL), utilizing suboptimal and heterogeneous demonstrations presents a substantial challenge due to the varied nature of real-world data. However, standard IL algorithms consider these datasets as homogeneous, thereby inheriting the deficiencies of suboptimal demonstrators. Previous approaches to this issue rely on impractical assumptions like high-quality data subsets, confidence rankings, or explicit environmental knowledge. This paper introduces IRLEED, Inverse Reinforcement Learning by Estimating Expertise of Demonstrators, a novel framework that overcomes these hurdles without prior knowledge of demonstrator expertise. IRLEED enhances existing Inverse Reinforcement Learning (IRL) algorithms by combining a general model for demonstrator suboptimality to address reward bias and action variance, with a Maximum Entropy IRL framework to efficiently derive the optimal policy from diverse, suboptimal demonstrations. Experiments in both online and offline IL settings, with simulated and human-generated data, demonstrate IRLEED's adaptability and effectiveness, making it a versatile solution for learning from suboptimal demonstrations.
翻译:在模仿学习中,由于现实世界数据的多样性,利用次优且异构的演示数据构成了重大挑战。然而,标准的模仿学习算法将这些数据集视为同质数据,从而继承了次优演示者的缺陷。以往解决这一问题的方法依赖于不切实际的假设,例如高质量数据子集、置信度排序或显式的环境知识。本文提出IRLEED(基于演示者专业能力评估的逆强化学习),这是一种无需先验演示者专业知识的全新框架,能够克服这些障碍。IRLEED通过结合处理奖励偏差与动作方差的通用演示者次优性模型,以及最大熵逆强化学习框架,改进了现有逆强化学习算法,从而能够从多样化的次优演示中高效推导出最优策略。在在线与离线模仿学习设置中,使用模拟和人类生成数据进行的实验证明了IRLEED的适应性和有效性,使其成为从次优演示中学习的通用解决方案。