T-Person-GAN: Text-to-Person Image Generation with Identity-Consistency and Manifold Mix-Up

In this paper, we present an end-to-end approach to generate high-resolution person images conditioned on texts only. State-of-the-art text-to-image generation models are mainly designed for center-object generation, e.g., flowers and birds. Unlike center-placed objects with similar shapes and orientation, person image generation is a more challenging task, for which we observe the followings: 1) the generated images for the same person exhibit visual details with identity-consistency, e.g., identity-related textures/clothes/shoes across the images, and 2) those images should be discriminant for being robust against the inter-person variations caused by visual ambiguities. To address the above challenges, we develop an effective generative model to produce person images with two novel mechanisms. In particular, our first mechanism (called T-Person-GAN-ID) is to integrate the one-stream generator with an identity-preserving network such that the representations of generated data are regularized in their feature space to ensure the identity-consistency. The second mechanism (called T-Person-GAN-ID-MM) is based on the manifold mix-up to produce mixed images via the linear interpolation across generated images from different manifold identities, and we further enforce such interpolated images to be linearly classified in the feature space. This amounts to learning a linear classification boundary that can perfectly separate images from two identities. Our proposed method is empirically validated to achieve a remarkable improvement in text-to-person image generation. Our architecture is orthogonal to StackGAN++ , and focuses on person image generation, with all of them together to enrich the spectrum of GANs for the image generation task. Codes are available on \url{https://github.com/linwu-github/Person-Image-Generation.git}.

翻译：本文提出了一种端到端方法，仅基于文本条件生成高分辨率人物图像。现有最先进的文本到图像生成模型主要针对中心物体（如花卉和鸟类）的生成设计。与具有相似形状和朝向的中心放置物体不同，人物图像生成是一项更具挑战性的任务。我们观察到以下特点：1）同一人物的生成图像需呈现身份一致性的视觉细节，例如图像中与身份相关的纹理、衣物和鞋子；2）这些图像应具备判别性，以应对由视觉模糊性引起的个体间差异。为应对上述挑战，我们开发了一种有效的生成模型，并引入两种新机制。第一种机制（称为T-Person-GAN-ID）通过将单流生成器与身份保持网络集成，在特征空间中对生成数据的表示进行正则化，确保身份一致性。第二种机制（称为T-Person-GAN-ID-MM）基于流形混合，通过对来自不同流形身份的生成图像进行线性插值生成混合图像，并进一步强制这些插值图像在特征空间中被线性分类。这等价于学习一条能完美分离两个身份图像的线性分类边界。实验验证表明，所提方法在文本到人物图像生成任务中取得了显著改进。我们的架构与StackGAN++正交，专注于人物图像生成，与现有方法共同丰富了图像生成任务中生成对抗网络（GAN）的谱系。代码已发布于\url{https://github.com/linwu-github/Person-Image-Generation.git}。