We present AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging hand-object interactions. The dataset includes synchronized egocentric and exocentric images sampled from the recent Assembly101 dataset, in which participants assemble and disassemble take-apart toys. To obtain high-quality 3D hand pose annotations for the egocentric images, we develop an efficient pipeline, where we use an initial set of manual annotations to train a model to automatically annotate a much larger dataset. Our annotation model uses multi-view feature fusion and an iterative refinement scheme, and achieves an average keypoint error of 4.20 mm, which is 85% lower than the error of the original annotations in Assembly101. AssemblyHands provides 3.0M annotated images, including 490K egocentric images, making it the largest existing benchmark dataset for egocentric 3D hand pose estimation. Using this data, we develop a strong single-view baseline of 3D hand pose estimation from egocentric images. Furthermore, we design a novel action classification task to evaluate predicted 3D hand poses. Our study shows that having higher-quality hand poses directly improves the ability to recognize actions.
翻译:我们提出了AssemblyHands,一个包含精确3D手部姿态标注的大规模基准数据集,旨在促进具有挑战性手-物交互的自我中心活动研究。该数据集包含从近期Assembly101数据集中采样的同步自我中心与外中心图像,其中参与者完成可拆解玩具的组装与拆卸任务。为获取高质量的自我中心图像3D手部姿态标注,我们开发了一条高效流程:首先使用初始人工标注集训练模型,进而自动标注更大规模的数据集。我们的标注模型采用多视角特征融合与迭代优化方案,实现了4.20毫米的平均关键点误差,较Assembly101原始标注误差降低85%。AssemblyHands提供了300万张标注图像(含49万张自我中心图像),成为现有最大的自我中心3D手部姿态估计基准数据集。基于该数据,我们建立了从自我中心图像进行3D手部姿态估计的强单视角基线方法。此外,我们设计了一项新颖的动作分类任务来评估预测的3D手部姿态。研究表明,更高质量的手部姿态能直接提升动作识别能力。