Scene understanding has made tremendous progress over the past few years, as data acquisition systems are now providing an increasing amount of data of various modalities (point cloud, depth, RGB...). However, this improvement comes at a large cost on computation resources and data annotation requirements. To analyze geometric information and images jointly, many approaches rely on both a 2D loss and 3D loss, requiring not only 2D per pixel-labels but also 3D per-point labels. However, obtaining a 3D groundtruth is challenging, time-consuming and error-prone. In this paper, we show that image segmentation can benefit from 3D geometric information without requiring a 3D groundtruth, by training the geometric feature extraction and the 2D segmentation network jointly, in an end-to-end fashion, using only the 2D segmentation loss. Our method starts by extracting a map of 3D features directly from a provided point cloud by using a lightweight 3D neural network. The 3D feature map, merged with the RGB image, is then used as an input to a classical image segmentation network. Our method can be applied to many 2D segmentation networks, improving significantly their performance with only a marginal network weight increase and light input dataset requirements, since no 3D groundtruth is required.
翻译:场景理解在过去几年取得了巨大进展,数据采集系统正提供日益增长的多模态数据(点云、深度、RGB等)。然而,这一进步伴随着高昂的计算资源开销和数据标注需求。为联合分析几何信息与图像,许多方法同时依赖二维损失和三维损失,不仅需要二维逐像素标签,还需三维逐点标签。但获取三维真实标注具有挑战性、耗时且易出错。本文证明,图像分割可在无需三维真实标注的情况下受益于三维几何信息——通过仅使用二维分割损失,以端到端方式联合训练几何特征提取网络与二维分割网络。方法首先利用轻量级三维神经网络直接从输入点云提取三维特征图,随后将该三维特征图与RGB图像融合,输入经典图像分割网络。本方法可适配多种二维分割网络,仅需微小的网络参数量增加和轻量的输入数据集需求(无需三维真实标注),即可显著提升其性能。