Recent advancements in multimodal large language models (LLMs) have shown their potential in various domains, especially concept reasoning. Despite these developments, applications in understanding 3D environments remain limited. This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D takes point cloud data and text prompts as input to produce textual responses and segmentation masks, facilitating advanced tasks like 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs. Specifically, we propose a hierarchical mask decoder to locate small objects within expansive scenes. This decoder initially generates a coarse location estimate covering the object's general area. This foundational estimation facilitates a detailed, coarse-to-fine segmentation strategy that significantly enhances the precision of object identification and segmentation. Experiments validate that Reason3D achieves remarkable results on large-scale ScanNet and Matterport3D datasets for 3D express referring, 3D question answering, and 3D reasoning segmentation tasks. Code and models are available at: https://github.com/KuanchihHuang/Reason3D.
翻译:近年来,多模态大语言模型(LLMs)的进展已展现出其在诸多领域的潜力,尤其是在概念推理方面。尽管取得了这些发展,其在理解三维环境方面的应用仍然有限。本文提出了Reason3D,一种专为全面三维理解而设计的新型大语言模型。Reason3D以点云数据和文本提示作为输入,生成文本响应和分割掩码,从而支持高级任务,如三维推理分割、层次化搜索、表达性指代以及带有详细掩码输出的问答。具体而言,我们提出了一种层次化掩码解码器,用于在广阔场景中定位小物体。该解码器首先生成一个覆盖物体大致区域的粗略位置估计。这一基础估计进而促成了一个从粗到细的详细分割策略,显著提升了物体识别与分割的精度。实验验证表明,Reason3D在大规模ScanNet和Matterport3D数据集上的三维表达性指代、三维问答以及三维推理分割任务中取得了显著成果。代码与模型发布于:https://github.com/KuanchihHuang/Reason3D。