A promise of Generative Adversarial Networks (GANs) is to provide cheap photorealistic data for training and validating AI models in autonomous driving. Despite their huge success, their performance on complex images featuring multiple objects is understudied. While some frameworks produce high-quality street scenes with little to no control over the image content, others offer more control at the expense of high-quality generation. A common limitation of both approaches is the use of global latent codes for the whole image, which hinders the learning of independent object distributions. Motivated by SemanticStyleGAN (SSG), a recent work on latent space disentanglement in human face generation, we propose a novel framework, Urban-StyleGAN, for urban scene generation and manipulation. We find that a straightforward application of SSG leads to poor results because urban scenes are more complex than human faces. To provide a more compact yet disentangled latent representation, we develop a class grouping strategy wherein individual classes are grouped into super-classes. Moreover, we employ an unsupervised latent exploration algorithm in the $\mathcal{S}$-space of the generator and show that it is more efficient than the conventional $\mathcal{W}^{+}$-space in controlling the image content. Results on the Cityscapes and Mapillary datasets show the proposed approach achieves significantly more controllability and improved image quality than previous approaches on urban scenes and is on par with general-purpose non-controllable generative models (like StyleGAN2) in terms of quality.
翻译:生成对抗网络(GANs)的一个潜在优势是为自动驾驶中AI模型的训练与验证提供廉价的逼真数据。尽管取得了巨大成功,这些网络在包含多个对象的复杂图像上的表现仍缺乏深入研究。一些框架能生成高质量街景但对图像内容几乎无法控制,而另一些框架虽提供更多控制能力却牺牲了生成质量。这两种方法的共同局限在于对整个图像使用全局潜在编码,这阻碍了独立对象分布的学习。受近期关于人脸生成潜在空间解耦研究SemanticStyleGAN(SSG)的启发,我们提出一种新型框架Urban-StyleGAN用于城市场景生成与操控。研究发现直接应用SSG会导致较差结果,这是因为城市场景比人脸更为复杂。为提供更紧凑且解耦的潜在表示,我们开发了一种类别分组策略,将独立类别归并为超类。此外,我们在生成器的$\mathcal{S}$空间中采用无监督潜在探索算法,并证明其在控制图像内容方面比传统的$\mathcal{W}^{+}$空间更高效。在Cityscapes和Mapillary数据集上的结果表明,所提方法在城市场景上较现有方法实现了显著更强的可控性和更优的图像质量,且其生成质量与通用非可控生成模型(如StyleGAN2)相当。