3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description. Typically, the sentences describing the target object tend to provide information about its relative relation between other objects and its position within the whole scene. In this work, we propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3DRP-Net), which can effectively capture the relative spatial relationships between objects and enhance object attributes. Specifically, 1) we propose a 3D Relative Position Multi-head Attention (3DRP-MA) module to analyze relative relations from different directions in the context of object pairs, which helps the model to focus on the specific object relations mentioned in the sentence. 2) We designed a soft-labeling strategy to alleviate the spatial ambiguity caused by redundant points, which further stabilizes and enhances the learning process through a constant and discriminative distribution. Extensive experiments conducted on three benchmarks (i.e., ScanRefer and Nr3D/Sr3D) demonstrate that our method outperforms all the state-of-the-art methods in general. The source code will be released on GitHub.
翻译:三维视觉定位旨在通过自由形式的语言描述在三维点云中定位目标物体。通常,描述目标物体的语句会提供其与其他物体的相对关系以及在整个场景中的位置信息。本文提出一种基于关系感知的单阶段框架——三维相对位置感知网络(3DRP-Net),该网络能够有效捕捉物体间的相对空间关系并增强物体属性。具体而言:1)我们设计了三维相对位置多头注意力(3DRP-MA)模块,通过分析物体配对语境中不同方向的相对关系,帮助模型聚焦于语句中提及的特定物体关联;2)提出一种软标签策略缓解冗余点云造成的空间模糊性,通过恒定且具有区分度的分布进一步稳定并增强学习过程。在三个基准数据集(ScanRefer、Nr3D/Sr3D)上的大量实验表明,本方法整体性能优于所有当前最先进方法。源代码将在GitHub上发布。