Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.2%, and +0.73% on Sr3D, Nr3D, and ScanRefer. Code will be released at https://github.com/ZiyuGuo99/ViewRefer3D.
翻译:多视角输入理解三维场景已被证明能有效缓解三维视觉定位中的视角差异问题。然而现有方法通常忽视文本模态中蕴含的视角线索,且未能对不同视角的相对重要性进行合理权衡。本文提出ViewRefer——一个面向三维视觉定位的多视角框架,旨在探索如何从文本与三维模态中同时获取视角知识。在文本分支中,ViewRefer利用GPT等大规模语言模型的多样化语言知识,将单一定位文本扩展为多个几何一致性描述。同时,在三维模态中引入跨视角注意力机制的Transformer融合模块,以增强跨视角目标交互。在此基础上,我们进一步提出一组可学习多视角原型,该原型存储不同视角的场景无关知识,并通过两方面增强框架:视角引导注意力模块以生成更鲁棒的文本特征,以及最终预测阶段的视角引导评分策略。通过所设计的范式,ViewRefer在三个基准数据集上均取得卓越性能,在Sr3D、Nr3D和ScanRefer上分别以+2.8%、+1.2%和+0.73%的优势超越第二名。代码将发布于https://github.com/ZiyuGuo99/ViewRefer3D。