3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate these issues, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: position alignment loss and semantic alignment loss. On top of that, we further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted 3D visual grounding datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our newly-proposed task. The source code is available at https://github.com/yanmin-wu/EDA.
翻译:3D视觉定位旨在根据富含语义信息的自然语言描述,在点云中定位目标对象。然而,现有方法要么提取耦合所有词语的句子级特征,要么过度关注目标名称,导致词级信息丢失或忽视其他属性。针对这些问题,我们提出EDA方法,通过显式解耦句子中的文本属性,并在细粒度语言与点云对象之间进行密集对齐。具体地,我们首先设计文本解耦模块,为每个语义成分生成文本特征;随后设计两种损失函数(位置对齐损失与语义对齐损失)以监督两种模态间的密集匹配。在此基础上,我们进一步引入新的视觉定位任务——无需目标名称的对象定位,用于全面评估模型的密集对齐能力。实验表明,我们在ScanRefer和SR3D/NR3D这两个广泛采用的3D视觉定位数据集上取得了当前最优性能,并在新提出的任务上实现绝对领先。源代码已开源至https://github.com/yanmin-wu/EDA。