State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).
翻译:当代3D分割基准(如ScanNet)上的最先进模型,通常消费并标注数据集提供的3D点云——这些点云通过对感知的多视角RGB-D图像进行后处理获得。这些模型通常在同领域内训练,放弃了大规模2D预训练,优于那些以姿态RGB-D多视角图像为特征输入的替代方法。基于感知图像的方法与基于后处理3D点云的方法之间的性能差距,助长了“2D和3D感知需要不同模型架构”的认知。本文挑战这一观点,提出ODIN(全方位实例分割),一种能同时分割并标注2D RGB图像与3D点云的模型,其Transformer架构在2D视图内信息融合与3D跨视图信息融合之间交替切换。该模型通过令牌的位置编码区分2D与3D特征操作:2D面片令牌捕获像素坐标,3D特征令牌捕获3D坐标。ODIN在ScanNet200、Matterport3D和AI2THOR 3D实例分割基准上达到最先进性能,在ScanNet、S3DIS和COCO上表现竞争力强劲。当使用感知的3D点云替代从3D网格采样的点云时,ODIN以显著优势超越所有先前工作。作为可指令具身智能体架构中的3D感知引擎,它在TEACh对话驱动动作基准上树立了新基准。代码与检查点详见项目网站(https://odin-seg.github.io)。