Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.

翻译：三维多模态大语言模型（MLLMs）的性能仍落后于其二维对应模型，这主要源于大规模、高质量的三维场景-对话数据集依然匮乏。现有方法依赖于昂贵的人工标注，且未能解决两个关键歧义问题：视角歧义（空间语言预设了未知的相机位姿）与目标指代歧义（非排他性描述模糊了目标与干扰物之间的界限）。为此，我们提出一种全自动流程，能够以远低于以往的成本将原始三维扫描数据转换为无歧义的高质量对话数据。该流程通过结合基于规则的约束与二维MLLMs及LLMs，实现了无需人工干预的可控、可扩展生成。流程包含四个阶段：（1）元标注收集：获取物体级、帧级与场景级描述；（2）场景图构建与关系校正：捕捉邻近物体关系；（3）判别性目标指代：生成具有排他性与紧凑性的描述；（4）多任务数据生成：合成多样化对话。我们的流程系统性地缓解了源数据集的固有缺陷，并最终产出Disc3D数据集——包含超过200万个样本，覆盖2.5万个混合三维场景，涵盖场景描述、视角描述、物体描述、视觉定位以及五项以物体为中心的问答任务。大量实验表明，使用Disc3D进行训练能在公开基准测试及我们提出的多维度Disc3D-QA任务上带来一致且显著的性能提升。代码、数据及模型将公开发布。