Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and autonomous driving. However, the majority of existing 3D object grounding methods are restricted to a single-sentence input describing an individual object, which cannot comprehend and reason more contextualized descriptions of multiple objects in more practical 3D cases. To this end, we introduce a new challenging task, called 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. Instead of naively localizing each sentence-guided object independently, we found that dense objects described in the same paragraph are often semantically related and spatially located in a focused region of the 3D scene. To explore such semantic and spatial relationships of densely referred objects for more accurate localization, we propose a novel Stacked Transformer based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a contextual query-driven local transformer decoder to generate initial grounding proposals for each target object. Then, we employ a proposal-guided global transformer decoder that exploits the local object features to learn their correlation for further refining initial grounding proposals. Extensive experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.
翻译:在三维场景中根据自然语言的语义定位物体是多媒体理解领域一项基础而重要的任务,这对机器人和自动驾驶等实际应用具有促进作用。然而,现有的大多数三维物体接地方法局限于描述单个物体的单句输入,无法理解和推理更实际的三维场景中涉及多个物体的上下文描述。为此,我们提出了一个名为三维密集物体接地(3D DOG)的新挑战性任务,旨在联合定位较复杂段落(而非单一句子)中描述的多个物体。我们并非独立地朴素定位每个句子指引的物体,而是发现同一段落中描述的密集物体通常在语义上相关,且在三维场景中的空间位置集中于一个焦点区域。为了探索密集引用物体的语义和空间关系以实现更精准的定位,我们提出了一种基于堆叠Transformer的新型框架3DOGSFormer用于三维密集物体接地。具体而言,我们首先设计了一个上下文查询驱动的局部Transformer解码器,为每个目标物体生成初始接地候选框;随后采用候选框引导的全局Transformer解码器,利用局部物体特征学习其相关性以进一步优化初始接地候选框。在三个具有挑战性的基准数据集(Nr3D、Sr3D和ScanRefer)上的大量实验表明,我们提出的3DOGSFormer在性能上显著优于最先进的三维单物体接地方法及其密集物体变体方法。