Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.
翻译:多视图输入理解三维场景已被证明能够缓解三维视觉定位中的视角差异问题。然而,现有方法通常忽略文本模态中嵌入的视角线索,且未能权衡不同视图的相对重要性。本文提出ViewRefer——一个面向三维视觉定位的多视图框架,旨在探索如何从文本和三维两种模态中捕获视图知识。在文本分支中,ViewRefer利用大规模语言模型(如GPT)的多样化语言知识,将单一定位文本扩展为多个几何一致性描述。同时,在三维模态中,引入带视图间注意力的Transformer融合模块,以增强跨视图目标交互。在此基础上,本文进一步提出一组可学习的多视图原型,存储各视图的场景无关知识,并从两个维度增强框架:视图引导注意力模块以获取更鲁棒的文本特征,以及最终预测阶段的视图引导评分策略。通过所设计的范式,ViewRefer在三个基准数据集上实现了优越性能,在Sr3D、Nr3D和ScanRefer上分别超越第二名2.8%、1.5%和1.35%。