Recent advances in generative adversarial networks (GANs) have achieved great success in automated image composition that generates new images by embedding interested foreground objects into background images automatically. On the other hand, most existing works deal with foreground objects in two-dimensional (2D) images though foreground objects in three-dimensional (3D) models are more flexible with 360-degree view freedom. This paper presents an innovative View Alignment GAN (VA-GAN) that composes new images by embedding 3D models into 2D background images realistically and automatically. VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable. The differential discriminator guides to learn geometric transformation from background images so that the composed 3D models can be aligned with the background images with realistic poses and views. The texture generator adopts a novel view encoding mechanism for generating accurate object textures for the 3D models under the estimated views. Extensive experiments over two synthesis tasks (car synthesis with KITTI and pedestrian synthesis with Cityscapes) show that VA-GAN achieves high-fidelity composition qualitatively and quantitatively as compared with state-of-the-art generation methods.
翻译:生成对抗网络(GANs)的最新进展已在自动图像合成领域取得显著成功,该方法通过将感兴趣的前景物体自动嵌入背景图像来生成新图像。然而,现有研究大多处理二维(2D)图像中的前景物体,尽管三维(3D)模型中的前景物体因具备360度视角自由度而更具灵活性。本文提出一种创新的视角对齐生成对抗网络(VA-GAN),该网络通过将3D模型真实、自动地嵌入2D背景图像来合成新图像。VA-GAN由一个纹理生成器和一个差分判别器组成,两者互联且可端到端训练。差分判别器引导网络学习背景图像的几何变换,从而使嵌入的3D模型能以真实的姿态和视角与背景图像对齐。纹理生成器采用新颖的视角编码机制,能够在估计视角下为3D模型生成精确的物体纹理。在两个合成任务(基于KITTI的车辆合成和基于Cityscapes的行人合成)上的大量实验表明,与当前最先进的生成方法相比,VA-GAN在定性和定量上均实现了高保真度的合成效果。