Collecting diverse sets of training images for RGB-D semantic image segmentation is not always possible. In particular, when robots need to operate in privacy-sensitive areas like homes, the collection is often limited to a small set of locations. As a consequence, the annotated images lack diversity in appearance and approaches for RGB-D semantic image segmentation tend to overfit the training data. In this paper, we thus introduce semantic RGB-D image synthesis to address this problem. It requires synthesising a realistic-looking RGB-D image for a given semantic label map. Current approaches, however, are uni-modal and cannot cope with multi-modal data. Indeed, we show that extending uni-modal approaches to multi-modal data does not perform well. In this paper, we therefore propose a generator for multi-modal data that separates modal-independent information of the semantic layout from the modal-dependent information that is needed to generate an RGB and a depth image, respectively. Furthermore, we propose a discriminator that ensures semantic consistency between the label maps and the generated images and perceptual similarity between the real and generated images. Our comprehensive experiments demonstrate that the proposed method outperforms previous uni-modal methods by a large margin and that the accuracy of an approach for RGB-D semantic segmentation can be significantly improved by mixing real and generated images during training.
翻译:收集多样化的RGB-D语义图像分割训练数据集并非总是可行。特别是当机器人需要在家庭等隐私敏感区域运行时,数据收集通常仅限于少数几个地点。因此,标注图像在视觉外观上缺乏多样性,导致RGB-D语义图像分割方法容易对训练数据过拟合。本文提出语义RGB-D图像合成方法来解决这一问题。该方法需要根据给定的语义标签图合成逼真的RGB-D图像。然而,现有方法均为单模态方法,无法处理多模态数据。事实上,研究表明将单模态方法直接扩展至多模态数据效果不佳。为此,本文提出一种面向多模态数据的生成器,将语义布局的模态无关信息与分别生成RGB图像和深度图像所需的模态相关信息进行分离。此外,我们提出了一种判别器,确保标签图与生成图像之间的语义一致性,以及真实图像与生成图像之间的感知相似性。综合实验表明,所提方法大幅优于现有单模态方法,且训练过程中混合使用真实图像与生成图像可显著提升RGB-D语义分割方法的精度。