3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the attention maps accurately capture the layout described in the text. Benefiting from the proposed dual-branch design, the queries are allowed to focus on points relevant to each branch's specific objective. Moreover, we design an adaptive position encoding method for each branch respectively. In the target object branch, the position encoding relies on the relative positions between seed points and predicted 3D boxes. In the surrounding branch, the attention map is additionally guided by the confidence between visual and text features, enabling the queries to focus on points that have valuable layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and Nr3D.
翻译:三维视觉定位旨在识别三维点云场景中与特定自然语言描述相匹配的物体。这要求模型不仅要关注目标物体本身,还需考虑周围环境以判断描述是否得到满足。先前大多数工作试图在同一模块内完成这两项任务,这容易导致注意力分散。为此,我们提出PD-APE,一种双分支解码框架,分别解码目标物体属性和周围布局。具体而言,在目标物体分支中,解码器处理描述目标物体特征(如类别和颜色)的文本标记,引导查询关注目标物体本身。在周围环境分支中,查询与携带周围环境信息的其他文本标记对齐,使注意力图能够准确捕捉文本中描述的布局。得益于所提出的双分支设计,查询可以专注于与每个分支特定目标相关的点。此外,我们分别为每个分支设计了一种自适应位置编码方法。在目标物体分支中,位置编码依赖于种子点与预测三维边界框之间的相对位置。在周围环境分支中,注意力图额外受到视觉与文本特征之间置信度的引导,使查询能够关注具有有价值布局信息的点。大量实验表明,我们在两个广泛采用的三维视觉定位数据集ScanRefer和Nr3D上超越了现有最优方法。