FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.

翻译：预训练基础模型在各种视觉任务中的优异表现，凸显了其增强2D模型开放词汇能力的潜力。现有研究探索了这类方法在3D空间中的类似应用，但大多数仅聚焦于从单一基础模型中提取知识，这限制了3D模型的开放词汇能力。我们假设，利用多种基础模型的互补性预训练知识，可以改进从2D预训练视觉语言模型到3D空间的知识迁移。本文提出FM-OV3D方法，即基于基础模型的跨模态知识融合实现开放词汇3D检测，通过融合多种预训练基础模型的知识，提升3D模型的开放词汇定位与识别能力，从而突破原始3D数据集的约束，实现真正的开放词汇检测。具体而言，为学习开放词汇3D定位能力，我们采用Grounded-Segment-Anything模型的开放词汇定位知识；为学习开放词汇3D识别能力，我们利用生成式基础模型（包括GPT-3和Stable Diffusion）以及跨模态判别模型（如CLIP）的知识。在两个开放词汇3D目标检测主流基准上的实验结果表明，我们的模型能够有效从多种基础模型中学习知识以增强3D模型的开放词汇能力，并在开放词汇3D目标检测任务中成功实现最先进性能。代码已发布于https://github.com/dmzhang0425/FM-OV3D.git。