Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io
翻译:在自我中心计算机视觉中,对人类手部与物体在操作过程中的精确三维理解仍是一项重大挑战。现有手物交互数据集主要在受控工作室环境中采集,这限制了环境多样性,也削弱了基于此类数据训练的模型向现实场景泛化的能力。为应对这一挑战,我们提出了一种新颖的无标记多相机系统,该系统能够在真正的野外条件下实现近乎无约束的移动,同时仍具备生成手部和物体精确三维标注的能力。采集系统由轻便的背戴式多相机支架组成,该支架与用户佩戴的VR头显同步并完成标定。针对手部和物体的三维真值标注,我们开发了一套内外联合追踪流程,并严格评估了其质量。最后,我们发布了SHOW3D——首个包含三维标注的大规模数据集,展示了在包括户外在内的多样化真实环境中手部与物体的交互行为。我们的方法显著降低了环境真实性与三维标注精度之间的根本性权衡,并通过多个下游任务的实验验证了其有效性。show3d-dataset.github.io