Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a completely new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free, easy-to-train, and flexible approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin. Furthermore, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with state-of-the-art 2D open-world models such as ODISE and GroundingDINO, superb results are observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries, including those that require intricate reasoning and world knowledge. The code and model will be made publicly available.
翻译:当前的三维开放词汇场景理解方法大多利用对齐良好的二维图像作为桥梁,学习三维特征与语言的关联。然而,在缺乏二维图像的场景中,这些方法的应用面临挑战。本文提出一种全新的流水线框架——OpenIns3D,无需二维图像输入即可实现三维开放词汇场景的实例级理解。OpenIns3D框架采用“掩码-快照-查找”机制:其中“掩码”模块学习三维点云中的类别无关掩码提案;“快照”模块生成多尺度合成场景级图像,并借助二维视觉语言模型提取感兴趣对象;“查找”模块借助掩码-像素映射图(记录三维掩码与合成图像精确对应关系的工具)搜索“快照”模块的输出结果,为掩码提案分配类别名称。这种无需二维输入、易于训练且灵活的方法,在多种室内外数据集上以显著优势取得了最佳结果。此外,OpenIns3D无需重新训练即可便捷切换二维检测器:与ODISE、GroundingDINO等先进二维开放世界模型集成时,在开放词汇实例分割任务中表现卓越;与LISA等基于大语言模型的二维模型结合时,能高效处理高度复杂的文本查询(包括需要复杂推理和世界知识的查询)。代码与模型将公开提供。