The recent development in multimodal learning has greatly advanced the research in 3D scene understanding in various real-world tasks such as embodied AI. However, most existing work shares two typical constraints: 1) they are short of reasoning ability for interaction and interpretation of human intension and 2) they focus on scenarios with single-category objects only which leads to over-simplified textual descriptions due to the negligence of multi-object scenarios and spatial relations among objects. We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes. The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects. To this end, we create ReasonSeg3D, a large-scale and high-quality benchmark that integrates 3D spatial relations with generated question-answer pairs and 3D segmentation masks. In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs. Extensive experiments show that MORE3D excels in reasoning and segmenting complex multi-object 3D scenes, and the created ReasonSeg3D offers a valuable platform for future exploration of 3D reasoning segmentation. The dataset and code will be released.
翻译:近年来,多模态学习的发展极大地推动了三维场景理解在具身人工智能等多种现实任务中的研究。然而,现有工作大多存在两个典型局限:1)缺乏对人类意图进行交互和解释的推理能力;2)仅关注单一类别物体的场景,由于忽视了多物体场景及物体间的空间关系,导致文本描述过于简化。我们通过提出面向场景中多物体的三维推理分割任务来弥补这些研究空白。该任务能够生成三维分割掩码,并辅以由物体间三维空间关系所丰富的详细文本解释。为此,我们创建了ReasonSeg3D,这是一个大规模、高质量的基准数据集,它将三维空间关系与生成的问答对及三维分割掩码相结合。此外,我们设计了MORE3D,这是一种简单而有效的方法,能够根据用户问题和文本输出实现多物体的三维推理分割。大量实验表明,MORE3D在复杂多物体三维场景的推理与分割方面表现优异,所创建的ReasonSeg3D为未来三维推理分割的探索提供了一个宝贵的平台。数据集与代码将予以公开。