In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.
翻译:在语义图像合成中,最先进的方法主要采用SPatially-Adaptive DE-normalization(SPADE)层的定制变体,这些变体能够实现良好的视觉生成质量和编辑多功能性。从设计上看,此类层根据每个像素所属的语义类别学习像素级调制参数,以去归一化生成器的激活值。因此,它们往往忽略全局图像统计信息,最终导致局部风格编辑效果不理想,并引发诸如颜色或光照分布偏移等全局不一致问题。此外,SPADE层需要语义分割掩码来映射生成器中的风格,从而在无人工干预的情况下限制了形状操作。针对这一问题,我们设计了一种新型架构,其中使用交叉注意力层替代SPADE来学习形状与风格之间的相关性,从而调节图像生成过程。我们的模型继承了SPADE的多功能性,同时获得了最先进的生成质量,并改进了全局和局部风格迁移。代码和模型可在 https://github.com/TFonta/CA2SIS 获取。