HOT3D：基于多视角第一人称视频的3D手部与物体追踪 (HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos)

Prithviraj Banerjee,Sindi Shkodrani,Pierre Moulon,Shreyas Hampali,Shangchen Han,Fan Zhang,Linguang Zhang,Jade Fountain,Edward Miller,Selen Basol,Richard Newcombe,Robert Wang,Jakob Julian Engel,Tomas Hodan

from arxiv, arXiv admin note: substantial text overlap with arXiv:2406.09598

We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground-truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.

翻译：我们介绍了HOT3D，一个公开可用的、用于第一人称视角下3D手部与物体追踪的数据集。该数据集提供了超过833分钟（超过370万张图像）的多视角RGB/单色图像流，展示了19名受试者与33个不同刚性物体的交互过程，同时包含眼球注视或场景点云等多模态信号，以及全面的真实标注，包括物体、手部和摄像机的3D位姿，以及手部和物体的3D模型。除了简单的拾取/观察/放下动作外，HOT3D还包含了模拟厨房、办公室和客厅环境中典型动作的场景。该数据集由Meta公司的两款头戴式设备录制：Project Aria（一款轻量级AR/AI眼镜的研究原型机）和Quest 3（一款已售出数百万台的生产型VR头显）。真实位姿通过专业的动作捕捉系统获得，该系统使用了附着在手部和物体上的小型光学标记。手部标注以UmeTrack和MANO格式提供，物体则通过内部扫描仪获取的、带有PBR材质的3D网格表示。在我们的实验中，我们证明了多视角第一人称数据在三个流行任务中的有效性：3D手部追踪、6自由度物体位姿估计以及未知手持物体的3D重建。经评估的多视角方法（其基准测试由HOT3D独特地实现）显著优于其对应的单视角方法。