3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.
翻译:利用大型语言模型进行三维对象分割已成为主流范式,因其具备广泛的语义理解能力、任务灵活性和强大的泛化性能。然而,该范式受到表征失准问题的制约:大型语言模型处理高层语义标记,而三维点云仅传达密集的几何结构。在现有方法中,这种失准同时限制了输入与输出阶段。在输入阶段,密集点云块需要繁重的预对齐操作,这会削弱对象级语义并混淆相似干扰物。在输出阶段,预测仅依赖密集特征而缺乏显式几何线索,导致细粒度精度损失。为突破这些局限,我们提出点语言学家模型——一个无需大规模三维-文本或三维-图像预对齐、能够桥接大型语言模型与密集三维点云表征差异的通用框架。具体而言,我们提出对象中心判别表征,该表征通过硬负样本感知训练目标学习捕获目标语义与场景关系的对象中心标记。这缓解了大型语言模型标记与三维点云之间的失准问题,增强了对干扰物的鲁棒性,并促进了大型语言模型内部的语义级推理。为实现精确分割,我们提出几何重激活解码器,该解码器通过融合携带大型语言模型推断几何信息的对象中心判别表征标记与对应密集特征来预测掩码,从而在整个流程中保持完整的密集特征。大量实验表明,点语言学家模型在三维指代分割任务上取得显著提升:ScanNetv2数据集上平均交并比提升+7.3%,Multi3DRefer数据集上提升+6.0%,并在涵盖4类不同任务的7个基准测试中保持一致的性能增益,这验证了全面对象中心推理对于鲁棒三维理解的有效性。