Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning part segmentation for 3D objects, aiming to output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate evaluation and benchmarking, we present a large 3D dataset comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations specifically curated for reasoning-based 3D part segmentation. We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. Experiments show that our method achieves competitive performance to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge. Our source code, dataset, and trained models are available at https://github.com/AmrinKareem/PARIS3D.
翻译:近期三维感知系统的进步显著提升了其在分割等视觉识别任务中的能力。然而,这些系统仍高度依赖明确的人工指令来识别目标对象或类别,缺乏主动推理并理解用户隐含意图的能力。我们提出一项面向三维物体的新型分割任务——推理式部件分割,旨在根据关于三维物体特定部件的复杂隐含文本查询输出分割掩码。为便于评估与基准测试,我们构建了一个包含超过6万条指令及其对应真实部件分割标注的大规模三维数据集,专门针对推理式三维部件分割进行定制。我们提出的模型能够基于隐含文本查询分割三维物体部件,并生成与三维物体分割请求对应的自然语言解释。实验表明,该方法在性能上与使用显式查询的模型相当,同时额外具备识别部件概念、进行推理并利用世界知识对其进行补充的能力。我们的源代码、数据集及预训练模型已开源至https://github.com/AmrinKareem/PARIS3D。