Semantic image synthesis (SIS) refers to the problem of generating realistic imagery given a semantic segmentation mask that defines the spatial layout of object classes. Most of the approaches in the literature, other than the quality of the generated images, put effort in finding solutions to increase the generation diversity in terms of style i.e. texture. However, they all neglect a different feature, which is the possibility of manipulating the layout provided by the mask. Currently, the only way to do so is manually by means of graphical users interfaces. In this paper, we describe a network architecture to address the problem of automatically manipulating or generating the shape of object classes in semantic segmentation masks, with specific focus on human faces. Our proposed model allows embedding the mask class-wise into a latent space where each class embedding can be independently edited. Then, a bi-directional LSTM block and a convolutional decoder output a new, locally manipulated mask. We report quantitative and qualitative results on the CelebMask-HQ dataset, which show our model can both faithfully reconstruct and modify a segmentation mask at the class level. Also, we show our model can be put before a SIS generator, opening the way to a fully automatic generation control of both shape and texture. Code available at https://github.com/TFonta/Semantic-VAE.
翻译:语义图像合成(SIS)指在给定定义物体类别空间布局的语义分割掩码条件下生成逼真图像的问题。现有文献中的大多数方法除生成图像质量外,主要致力于寻找提升纹理等风格维度生成多样性的解决方案。然而,这些方法均忽略了一个重要特性——对掩码提供的布局进行操控的可能性。当前,唯一实现此操作的方式是通过图形用户界面进行手动调整。本文提出一种网络架构,旨在解决语义分割掩码中物体类别形状的自动操控与生成问题,重点聚焦于人脸区域。所提出的模型支持将掩码按类别嵌入潜在空间,其中每个类别嵌入可被独立编辑。随后,通过双向LSTM模块与卷积解码器输出经局部编辑的新掩码。我们在CelebMask-HQ数据集上进行了定量与定性评估,结果表明该模型能够在类别层面实现分割掩码的高保真重建与修改。此外,我们证实该模型可置于SIS生成器之前,从而为形状与纹理的全自动生成控制开辟新路径。相关代码已开源至 https://github.com/TFonta/Semantic-VAE。