Practical Imitation Learning (IL) systems rely on large human demonstration datasets for successful policy learning. However, challenges lie in maintaining the quality of collected data and addressing the suboptimal nature of some demonstrations, which can compromise the overall dataset quality and hence the learning outcome. Furthermore, the intrinsic heterogeneity in human behavior can produce equally successful but disparate demonstrations, further exacerbating the challenge of discerning demonstration quality. To address these challenges, this paper introduces Learning to Discern (L2D), an offline imitation learning framework for learning from demonstrations with diverse quality and style. Given a small batch of demonstrations with sparse quality labels, we learn a latent representation for temporally embedded trajectory segments. Preference learning in this latent space trains a quality evaluator that generalizes to new demonstrators exhibiting different styles. Empirically, we show that L2D can effectively assess and learn from varying demonstrations, thereby leading to improved policy performance across a range of tasks in both simulations and on a physical robot.
翻译:实际模仿学习系统依赖于大规模人类示教数据集来实现成功的策略学习。然而,挑战在于如何维护采集数据的质量并处理部分示教存在的次优性,这会损害整体数据集质量进而影响学习效果。此外,人类行为的内在异质性可能产生同样成功但截然不同的示教,进一步加剧了辨别示教质量的难度。为解决这些挑战,本文提出"学会辨别"(L2D)——一种面向多样化质量与风格示教的离线模仿学习框架。通过少量带有稀疏质量标签的示教,我们学习时间嵌入轨迹段的潜在表征。在该潜在空间中进行偏好学习可训练出能泛化至不同风格新示教者的质量评估器。实验表明,L2D能有效评估并学习多样化示教,从而在仿真环境和实体机器人上的多个任务中提升策略性能。