State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).
翻译:在ScanNet等当代3D分割基准测试中,最先进的模型通常直接处理并标注数据集提供的3D点云数据——这些点云是通过对多视角RGB-D传感图像进行后处理得到的。此类模型通常在特定领域内训练,放弃大规模2D预训练,其性能优于那些基于位姿已知的RGB-D多视角图像进行特征提取的替代方法。基于位姿图像的方法与基于后处理3D点云的方法之间的性能差距,使得学界普遍认为2D与3D感知需要不同的模型架构。本文中,我们对此观点提出挑战,提出了ODIN(全维度实例分割模型)。该模型能够同时对2D RGB图像和3D点云进行分割与标注,其核心是交替进行2D视图内与3D跨视图信息融合的Transformer架构。我们的模型通过特征标记的位置编码来区分2D与3D特征操作:2D图像块标记使用像素坐标编码,而3D特征标记则使用3D坐标编码。ODIN在ScanNet200、Matterport3D和AI2THOR 3D实例分割基准测试中取得了最先进的性能,在ScanNet、S3DIS和COCO数据集上也表现出竞争力。当使用传感得到的原始3D点云替代从3D网格采样的点云时,ODIN以显著优势超越所有先前工作。在可指令具身智能体架构中作为3D感知引擎使用时,该模型在TEACh对话行为基准测试中创造了新的性能纪录。我们的代码与模型权重可在项目网站(https://odin-seg.github.io)获取。