Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations

Imitation learning with human data has demonstrated remarkable success in teaching robots in a wide range of skills. However, the inherent diversity in human behavior leads to the emergence of multi-modal data distributions, thereby presenting a formidable challenge for existing imitation learning algorithms. Quantifying a model's capacity to capture and replicate this diversity effectively is still an open problem. In this work, we introduce simulation benchmark environments and the corresponding Datasets with Diverse human Demonstrations for Imitation Learning (D3IL), designed explicitly to evaluate a model's ability to learn multi-modal behavior. Our environments are designed to involve multiple sub-tasks that need to be solved, consider manipulation of multiple objects which increases the diversity of the behavior and can only be solved by policies that rely on closed loop sensory feedback. Other available datasets are missing at least one of these challenging properties. To address the challenge of diversity quantification, we introduce tractable metrics that provide valuable insights into a model's ability to acquire and reproduce diverse behaviors. These metrics offer a practical means to assess the robustness and versatility of imitation learning algorithms. Furthermore, we conduct a thorough evaluation of state-of-the-art methods on the proposed task suite. This evaluation serves as a benchmark for assessing their capability to learn diverse behaviors. Our findings shed light on the effectiveness of these methods in tackling the intricate problem of capturing and generalizing multi-modal human behaviors, offering a valuable reference for the design of future imitation learning algorithms.

翻译：基于人类数据的模仿学习在教授机器人多种技能方面已展现出显著成功。然而，人类行为的内在多样性导致多模态数据分布的出现，给现有模仿学习算法带来严峻挑战。如何有效量化模型捕获并复现这种多样性的能力仍是一个悬而未决的问题。本文提出了模拟基准环境及相应的多样化人类示教模仿学习数据集（D3IL），其设计明确用于评估模型学习多模态行为的能力。所构建的环境需解决包含多个子任务的问题，涉及对多个物体的操作以增加行为多样性，且只能通过依赖闭环感知反馈的策略解决。现有数据集至少缺少上述挑战性特性中的一项。为应对多样性量化难题，我们引入了可计算的度量指标，可深入洞察模型获取与复现多样化行为的能力。这些指标为评估模仿学习算法的鲁棒性和通用性提供了实用手段。此外，我们在所提出的任务集上对现有最优方法进行了全面评估，该评估可作为衡量其学习多样化行为能力的基准。我们的研究结果揭示了这些方法在捕获并泛化多模态人类行为这一复杂问题上的有效性，为未来模仿学习算法的设计提供了有价值的参考。