Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands for (fine-grained) segmenting specific parts for 3D meshes with contextual awareness and reasoned answers for interactive segmentation. Specifically, Reasoning3D leverages an off-the-shelf pre-trained 2D segmentation network, powered by Large Language Models (LLMs), to interpret user input queries in a zero-shot manner. Previous research have shown that extensive pre-training endows foundation models with prior world knowledge, enabling them to comprehend complex commands, a capability we can harness to "segment anything" in 3D with limited 3D datasets (source efficient). Experimentation reveals that our approach is generalizable and can effectively localize and highlight parts of 3D objects (in 3D mesh) based on implicit textual queries, including these articulated 3d objects and real-world scanned data. Our method can also generate natural language explanations corresponding to these 3D models and the decomposition. Moreover, our training-free approach allows rapid deployment and serves as a viable universal baseline for future research of part-level 3d (semantic) object understanding in various fields including robotics, object manipulation, part assembly, autonomous driving applications, augment reality and virtual reality (AR/VR), and medical applications. The code, the model weight, the deployment guide, and the evaluation protocol are: http://tianrun-chen.github.io/Reason3D/

翻译：本文提出了一项新任务：面向物体部件搜索与定位的零样本三维推理分割。这是一种超越以往类别特定三维语义分割、三维实例分割以及开放词汇三维分割局限性的三维分割新范式。我们设计了一种简单的基线方法 Reasoning3D，该方法具备理解并执行复杂指令的能力，能够基于上下文感知和推理回答，对三维网格进行（细粒度的）特定部件分割，实现交互式分割。具体而言，Reasoning3D 利用一个现成的预训练二维分割网络，并借助大型语言模型（LLMs）以零样本方式解析用户输入查询。先前研究表明，广泛的预训练赋予基础模型先验的世界知识，使其能够理解复杂指令；我们可以利用这种能力，在有限的三维数据集（资源高效）条件下实现三维场景的“分割一切”。实验表明，我们的方法具有良好的泛化能力，能够基于隐含的文本查询（包括针对铰接式三维物体和真实世界扫描数据）有效地定位并高亮显示三维物体（三维网格中）的部件。我们的方法还能生成与这些三维模型及其分解相对应的自然语言解释。此外，这种免训练方法允许快速部署，并可作为未来在机器人、物体操控、部件装配、自动驾驶应用、增强现实与虚拟现实（AR/VR）以及医疗应用等多个领域中进行部件级三维（语义）物体理解研究的可行通用基线。代码、模型权重、部署指南和评估协议位于：http://tianrun-chen.github.io/Reason3D/