Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearances.To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.
翻译:目标在于将自然语言描述与以3D点云表示的3D场景中的特定区域相关联,3D视觉定位是人机交互中一项非常基础的任务。识别误差会显著影响整体精度,进而降低AI系统的运行性能。现有方法虽然有效,但在处理多个外观相似的邻近物体时面临识别精度低的难题。为解决这一问题,本文直观地引入人机交互作为线索,以促进3D视觉定位的发展。具体而言,首先针对这一关注点设计了一种称为“具身参考理解”(Embodied Reference Understanding, ERU)的新任务。随后构建了一个名为ScanERU的新数据集,用于评估该思路的有效性。与现有数据集不同,我们的ScanERU是首个涵盖文本、真实世界视觉与合成手势信息相结合的半合成场景整合的数据集。此外,本文提出了一种基于注意力机制和人体运动的启发式框架,以启迪ERU的研究。实验结果表明,所提方法具有优越性,尤其是在识别多个相同物体方面。我们的代码和数据集将公开发布。