We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGBD cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time compared to manual labeling. With this system, we captured a video dataset of humans interacting with objects to perform various tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance, which can serve as human demonstrations for research in embodied AI and robot manipulation. Our data capture setup and annotation framework will be available for the community to use in reconstructing 3D shapes of objects and human hands and tracking their poses in videos.
翻译:本文介绍了一种数据采集系统及新数据集HO-Cap,用于视频中手部与物体的三维重建与姿态跟踪。该系统利用多台RGBD相机与HoloLens头显进行数据采集,避免了昂贵的三维扫描仪或动作捕捉系统的使用。我们提出了一种半自动方法,用于对采集视频中手部与物体的形状与姿态进行标注,相比人工标注显著减少了标注时间。借助该系统,我们采集了人类与物体交互执行各类任务的视频数据集,包括简单的抓放动作、双手间传递动作以及根据物体可供性使用物体的动作,这些数据可作为具身人工智能与机器人操控研究中的人类示范样本。我们的数据采集装置与标注框架将向社区开放,用于重建物体与人手的三维形状并跟踪其在视频中的姿态。