We propose UniSeg3D, a unified 3D scene understanding framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary segmentation tasks within a single model. Most previous 3D segmentation approaches are typically tailored to a specific task, limiting their understanding of 3D scenes to a task-specific perspective. In contrast, the proposed method unifies six tasks into unified representations processed by the same Transformer. It facilitates inter-task knowledge sharing, thereby promoting comprehensive 3D scene understanding. To take advantage of multi-task unification, we enhance performance by establishing explicit inter-task associations. Specifically, we design knowledge distillation and contrastive learning methods to transfer task-specific knowledge across different tasks. Experiments on three benchmarks, including ScanNet20, ScanRefer, and ScanNet200, demonstrate that the UniSeg3D consistently outperforms current SOTA methods, even those specialized for individual tasks. We hope UniSeg3D can serve as a solid unified baseline and inspire future work. Code and models are available at https://github.com/dk-liang/UniSeg3D.
翻译:我们提出了UniSeg3D,一个统一的3D场景理解框架,能够在单一模型内完成全景、语义、实例、交互式、指代式和开放词汇分割任务。以往大多数3D分割方法通常针对特定任务定制,这限制了它们从任务特定视角理解3D场景的能力。相比之下,所提出的方法将六项任务统一为可由同一Transformer处理的统一表示。它促进了任务间的知识共享,从而推动了全面的3D场景理解。为了充分利用多任务统一的优势,我们通过建立显式的任务间关联来提升性能。具体而言,我们设计了知识蒸馏和对比学习方法,以在不同任务间传递任务特定知识。在包括ScanNet20、ScanRefer和ScanNet200在内的三个基准测试上的实验表明,UniSeg3D始终优于当前最先进的(SOTA)方法,甚至包括那些专为单个任务设计的方法。我们希望UniSeg3D能成为一个坚实的统一基线,并启发未来的工作。代码和模型可在 https://github.com/dk-liang/UniSeg3D 获取。