A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations. This feature is attractive for enabling robots to acquire new skills and improving task and motion planning. However, due to limitations in the training dataset, the current focus of the community has mainly been on simple cases, such as push or pick-place tasks, relying solely on visual guidance. In reality, there are many complex skills, some of which may even require both visual and tactile perception to solve. This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception. To achieve this, we have collected a dataset comprising over 110,000 \emph{contact-rich} robot manipulation sequences across diverse skills, contexts, robots, and camera viewpoints, all collected \emph{in the real world}. Each sequence in the dataset includes visual, force, audio, and action information, along with a corresponding human demonstration video. We have invested significant efforts in calibrating all the sensors and ensuring a high-quality dataset. The dataset is made publicly available at rh20t.github.io
翻译:开放域机器人操作中的一个关键挑战是如何为机器人获取多样化且可泛化的技能。近年来的研究表明,单次模仿学习在基于示教将训练策略迁移至新任务方面具有潜力。这一特性对于机器人获取新技能、改进任务与运动规划具有重要意义。然而,受限于训练数据集,当前领域的研究主要集中于简单场景(如推或拾取放置任务),且仅依赖视觉引导。现实中存在许多复杂技能,部分甚至需要视觉与触觉感知共同解决。本文旨在释放智能体通过多模态感知泛化至数百种真实世界技能的潜力。为此,我们收集了一个包含超11万条多技能、多场景、多机器人、多视角的接触密集型机器人操作序列数据集,所有数据均采集于真实环境。数据集中每条序列均包含视觉、力觉、听觉与动作信息,以及对应的人类示教视频。我们投入大量精力校准所有传感器,确保数据集的高质量。该数据集已公开于rh20t.github.io。