3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. Previous works usually require significant data relating to point color and their descriptions to exploit the corresponding complicated verbo-visual relations. In our work, we introduce Vigor, a novel Data-Efficient 3D Visual Grounding framework via Order-aware Referring. Vigor leverages LLM to produce a desirable referential order from the input description for 3D visual grounding. With the proposed stacked object-referring blocks, the predicted anchor objects in the above order allow one to locate the target object progressively without supervision on the identities of anchor objects or exact relations between anchor/target objects. In addition, we present an order-aware warm-up training strategy, which augments referential orders for pre-training the visual grounding framework. This allows us to better capture the complex verbo-visual relations and benefit the desirable data-efficient learning scheme. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in low-resource scenarios. In particular, Vigor surpasses current state-of-the-art frameworks by 9.3% and 7.6% grounding accuracy under 1% data and 10% data settings on the NR3D dataset, respectively.
翻译:三维视觉定位旨在根据自然语言描述,在三维点云场景中识别目标物体。现有方法通常需要大量包含点云颜色及其对应描述的数据,以利用其中复杂的语言-视觉关联。本文提出Vigor,一种基于顺序感知指代的新型数据高效三维视觉定位框架。Vigor利用大语言模型从输入描述中生成适用于三维视觉定位的指代顺序。通过所提出的堆叠式物体指代模块,按上述顺序预测的锚定物体可逐步定位目标物体,而无需对锚定物体身份或锚定/目标物体间的精确关系进行监督。此外,我们提出一种顺序感知预热训练策略,通过增强指代顺序对视觉定位框架进行预训练。该方法能更好地捕捉复杂的语言-视觉关联,并有利于实现数据高效的学习方案。在NR3D和ScanRefer数据集上的实验结果表明,本方法在低资源场景中具有显著优势。具体而言,在NR3D数据集上,Vigor在1%和10%数据设置下的定位准确率分别超越当前最优框架9.3%和7.6%。