Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.
翻译:使大语言模型(LLMs)与3D环境交互具有挑战性。现有方法要么从真实几何(GT)中提取点云,要么通过辅助模型重建3D场景,再利用CLIP提取的图文对齐2D特征升维至点云,作为LLMs的输入。然而,这种方案缺乏3D点对点连接的建立,导致空间结构信息缺失。同时,场景几何与语义表示之间缺乏整合与统一,最终限制了3D场景理解的深度。本文论证了统一场景表示与重建框架对3D场景中LLMs的关键作用。具体而言,我们提出Uni3DR^2:通过冻结的预训练2D基础模型(如CLIP和SAM)及多尺度聚合3D解码器,提取3D几何与语义感知的表示特征。所学习的3D表示不仅有助于重建过程,还能为LLMs提供有价值的知识。实验结果表明,在3D重建数据集ScanNet上,我们的Uni3DR^2相较于基线方法取得显著提升(F-Score提高1.8%)。当应用于LLMs时,我们的Uni3DR^2-LLM在3D视觉语言理解数据集ScanQA上表现优于基线方法(验证集和测试集的BLEU-1分别提升4.0%和4.2%),甚至超越使用额外GT点云的最先进方法,在ScanQA和3DMV-VQA上均取得更优性能。