With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.
翻译:随着大语言模型的出现及其与其他数据模态的融合,多模态三维感知因其与物理世界的连通性而受到更多关注并取得快速进展。然而,受限于现有数据集,先前工作主要集中于理解三维场景中的物体属性或物体间的空间关系。为解决此问题,本文构建了首个规模最大的、具有层次化基础语言标注的多模态三维场景数据集与基准测试集——MMScan。该数据集采用自上而下的逻辑构建,从区域级别到物体级别,从单一目标到目标间关系,全面覆盖空间与属性理解的各个方面。整体流程通过精心设计的提示词,结合强大的视觉语言模型高效初始化标注,并进一步引入人工校正环节,确保标注的自然性、正确性与完备性。基于现有三维扫描数据构建的最终多模态三维数据集,包含对10.9万个物体和7.7千个区域的140万条元标注描述,以及超过304万个多样化的三维视觉定位与问答基准测试样本。我们在基准测试集上评估了代表性基线模型,分析了它们在不同方面的能力,并揭示了未来需解决的关键问题。此外,我们利用这一高质量数据集训练了最先进的三维视觉定位模型与大语言模型,在现有基准测试集和开放环境评估中均取得了显著的性能提升。代码、数据集与基准测试集将在 https://github.com/OpenRobotLab/EmbodiedScan 发布。