We tackle the problem of learning an implicit scene representation for 3D instance segmentation from a sequence of posed RGB images. Towards this, we introduce 3DIML, a novel framework that efficiently learns a neural label field which can render 3D instance segmentation masks from novel viewpoints. Opposed to prior art that optimizes a neural field in a self-supervised manner, requiring complicated training procedures and loss function design, 3DIML leverages a two-phase process. The first phase, InstanceMap, takes as input 2D segmentation masks of the image sequence generated by a frontend instance segmentation model, and associates corresponding masks across images to 3D labels. These almost 3D-consistent pseudolabel masks are then used in the second phase, InstanceLift, to supervise the training of a neural label field, which interpolates regions missed by InstanceMap and resolves ambiguities. Additionally, we introduce InstanceLoc, which enables near realtime localization of instance masks given a trained neural label field. We evaluate 3DIML on sequences from the Replica and ScanNet datasets and demonstrate its effectiveness under mild assumptions for the image sequences. We achieve a large practical speedup over existing implicit scene representation methods with comparable quality, showcasing its potential to facilitate faster and more effective 3D scene understanding.
翻译:我们致力于解决从一系列姿态已知的RGB图像中学习用于三维实例分割的隐式场景表示的问题。为此,我们提出了3DIML,一种新颖的框架,它能高效地学习一个神经标签场,该场可以从新视角渲染三维实例分割掩码。与先前以自监督方式优化神经场、需要复杂训练过程和损失函数设计的方法不同,3DIML采用了一个两阶段流程。第一阶段,InstanceMap,以前端实例分割模型生成的图像序列二维分割掩码作为输入,并将跨图像的对应掩码关联到三维标签。这些近乎三维一致的伪标签掩码随后在第二阶段,InstanceLift,用于监督神经标签场的训练,该场能够插值InstanceMap遗漏的区域并解决模糊性。此外,我们提出了InstanceLoc,它能够在给定一个训练好的神经标签场的情况下,实现实例掩码的近实时定位。我们在Replica和ScanNet数据集的序列上评估了3DIML,并在对图像序列的温和假设下证明了其有效性。与现有具有可比质量的隐式场景表示方法相比,我们实现了显著的实际加速,展示了其在促进更快速、更有效的三维场景理解方面的潜力。