Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.
翻译:近年来,3D多模态大语言模型(3D-MLLMs)的发展为3D场景理解任务(包括视觉问答、描述生成和指代分割)提供了统一解决方案。然而,现有3D-MLLMs仍以物体为中心,限制了其对精细部件结构的建模能力——而这种能力对于具身交互3D环境至关重要。本文提出PAR3D,一种统一的部分感知3D-MLLM框架,使模型能够在3D场景中理解、推理并定位物体及其部件。为支撑部分感知3D场景理解的训练与评估,我们构建了ScenePart数据集——包含部件级标注与语言指令的合成3D场景数据集。进一步,我们开发了部件感知3D表示学习技术,通过细粒度部件级语义丰富3D视觉表征,并提出层次化分割查询生成机制,借助分层物体-部件查询实现部件目标的定位。大量实验表明,本方法显著提升了部件级问答与指代分割性能,同时在物体级视觉语言任务上保持强劲表现。