In this paper, we realize automatic visual recognition and direction estimation of pointing. We introduce the first neural pointing understanding method based on two key contributions. The first is the introduction of a first-of-its-kind large-scale dataset for pointing recognition and direction estimation, which we refer to as the DP Dataset. DP Dataset consists of more than 2 million frames of over 33 people pointing in various styles annotated for each frame with pointing timings and 3D directions. The second is DeePoint, a novel deep network model for joint recognition and 3D direction estimation of pointing. DeePoint is a Transformer-based network which fully leverages the spatio-temporal coordination of the body parts, not just the hands. Through extensive experiments, we demonstrate the accuracy and efficiency of DeePoint. We believe DP Dataset and DeePoint will serve as a sound foundation for visual human intention understanding.
翻译:本文实现了指向动作的自动视觉识别与方向估计。我们提出了首个基于神经网络的指向理解方法,其核心贡献包含两点。其一,我们首次构建了用于指向识别与方向估计的大规模数据集——DP数据集。该数据集包含超过33人以不同姿势指向的200余万帧图像,每帧都标注了指向时序与三维方向。其二,我们提出了DeePoint——一种用于指向动作联合识别与三维方向估计的新型深度网络模型。该模型基于Transformer架构,不仅利用手部信息,更充分整合了人体各部位在时空层面的协同特征。通过大量实验,我们验证了DeePoint的准确性与高效性。我们相信DP数据集与DeePoint将为视觉层面的人类意图理解奠定坚实基础。