Person image generation is an intriguing yet challenging problem. However, this task becomes even more difficult under constrained situations. In this work, we propose a novel pipeline to generate and insert contextually relevant person images into an existing scene while preserving the global semantics. More specifically, we aim to insert a person such that the location, pose, and scale of the person being inserted blends in with the existing persons in the scene. Our method uses three individual networks in a sequential pipeline. At first, we predict the potential location and the skeletal structure of the new person by conditioning a Wasserstein Generative Adversarial Network (WGAN) on the existing human skeletons present in the scene. Next, the predicted skeleton is refined through a shallow linear network to achieve higher structural accuracy in the generated image. Finally, the target image is generated from the refined skeleton using another generative network conditioned on a given image of the target person. In our experiments, we achieve high-resolution photo-realistic generation results while preserving the general context of the scene. We conclude our paper with multiple qualitative and quantitative benchmarks on the results.
翻译:人物图像生成是一个引人入胜且具有挑战性的问题。然而,在受限情境下,这一任务变得更加困难。在本工作中,我们提出了一种新颖的流水线,用于生成并将上下文相关的人物图像插入到现有场景中,同时保持全局语义。具体而言,我们的目标是插入一个人物,使其位置、姿态和尺度与场景中已有的人物自然融合。我们的方法在一个顺序流水线中使用了三个独立的网络。首先,我们通过将Wasserstein生成对抗网络(WGAN)以场景中现有人体骨架为条件,预测新人的潜在位置和骨骼结构。接着,通过一个浅层线性网络对预测的骨架进行细化,以提高生成图像的结构准确性。最后,利用另一个以目标人物给定图像为条件的生成网络,从细化后的骨架生成目标图像。在我们的实验中,我们实现了高分辨率、照片般真实的生成结果,同时保持了场景的整体上下文。我们在文末通过多项定性与定量基准测试对结果进行了评估。