We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be publicly available at: https://github.com/ZhanYang-nwpu/Mono3DVG.
翻译:我们提出了一项新颖的任务:利用同时包含外观与几何信息的语言描述,在单目RGB图像中进行3D视觉定位。具体而言,我们构建了一个大规模数据集Mono3DRefer,其中包含3D目标对象及其对应的几何文本描述(由ChatGPT生成并经过人工精修)。为推动此项任务,我们提出了Mono3DVG-TR——一种基于Transformer的端到端网络,该网络充分利用文本嵌入中的外观与几何信息进行多模态学习与3D目标定位。深度预测器被设计用于显式学习几何特征,而双文本引导适配器则用于精炼被指代对象的多尺度视觉与几何特征。基于深度-文本-视觉堆叠注意力机制,解码器将目标级别的几何线索与视觉外观融合为可学习的查询向量。我们为Mono3DVG提供了全面的基准测试与深入分析。大量对比实验及消融研究表明,我们的方法显著优于所有基线模型。数据集与代码将公开发布于:https://github.com/ZhanYang-nwpu/Mono3DVG。