Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

While 3D MLLMs have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its significance, the current landscape lacks tasks and datasets that endow and assess this capability. Therefore, we propose two novel tasks: (1) Part-Aware Point Grounding, the model is tasked with directly predicting a part-level segmentation mask based on user instructions, and (2) Part-Aware Point Grounded Captioning, the model provides a detailed caption that includes part-level descriptions and their corresponding masks. To support learning and evaluating for these tasks, we introduce 3DCoMPaT Grounded Instructions Dataset (3DCoMPaT-GRIN). 3DCoMPaT-GRIN Vanilla, comprising 789k part-aware point cloud-instruction-segmentation mask triplets, is used to evaluate MLLMs' ability of part-aware segmentation grounding. 3DCoMPaT-GRIN Grounded Caption, containing 107k part-aware point cloud-instruction-grounded caption triplets, assesses both MLLMs' part-aware language comprehension and segmentation grounding capabilities. Our introduced tasks, dataset, and Kestrel represent a preliminary effort to bridge the gap between human cognition and 3D MLLMs, i.e., the ability to perceive and engage with the environment at both global and part levels. Extensive experiments on the 3DCoMPaT-GRIN show that Kestrel can generate user-specified segmentation masks, a capability not present in any existing 3D MLLM. Kestrel thus established a benchmark for evaluating the part-aware language comprehension and segmentation grounding of 3D objects. Project page at https://feielysia.github.io/Kestrel.github.io/

翻译：尽管三维多模态大语言模型已取得显著进展，但其能力仍局限于物体与场景理解层面，难以在部件级别解析三维空间结构。本文提出Kestrel，作为一种创新方法，赋予三维多模态大语言模型部件感知理解能力，使其能够在部件层级实现更精准的三维物体解析与分割接地。尽管该能力至关重要，当前研究领域仍缺乏能够赋予并评估此能力的任务与数据集。为此，我们提出两项创新任务：（1）部件感知点云接地——要求模型根据用户指令直接预测部件级分割掩码；（2）部件感知点云接地描述生成——模型需生成包含部件级描述及其对应掩码的详细说明。为支持这些任务的学习与评估，我们构建了3DCoMPaT接地指令数据集（3DCoMPaT-GRIN）。其中，3DCoMPaT-GRIN基础版包含78.9万个部件感知点云-指令-分割掩码三元组，用于评估多模态大语言模型的部件感知分割接地能力；3DCoMPaT-GRIN接地描述版包含10.7万个部件感知点云-指令-接地描述三元组，可同步评估模型在部件感知语言理解与分割接地两方面的性能。我们提出的任务体系、数据集及Kestrel模型，初步构建了连接人类认知与三维多模态大语言模型的桥梁——即同时从整体与部件层面感知并交互环境的能力。在3DCoMPaT-GRIN上的大量实验表明，Kestrel能够生成用户指定的分割掩码，这是现有三维多模态大语言模型均未具备的能力。因此，Kestrel为评估三维物体的部件感知语言理解与分割接地性能确立了基准。项目页面详见 https://feielysia.github.io/Kestrel.github.io/