Do As I Do: Pose Guided Human Motion Copy

Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

翻译：人体运动复制是人工智能和计算机视觉领域中一项引人入胜且极具挑战性的任务，其目标是生成一段目标人物执行源人物动作的伪造视频。由于需要生成微妙的人体纹理细节并考虑时间一致性，该问题本质上具有挑战性。现有方法通常采用带有L1或L2损失的常规GAN来生成目标伪造视频，这本质上需要大量难以获取的训练样本。同时，当前方法在实现逼真的图像细节和时间一致性方面仍然存在困难，而这些缺陷很容易被人类观察者察觉。受此启发，我们尝试从三个方面解决这些问题：(1) 我们利用感知损失和具有理论依据的Gromov-Wasserstein损失来约束从姿态到外观的生成，以弥合姿态与外观之间的差距。(2) 我们在姿态到外观的生成中引入了一个情景记忆模块，以推动持续学习，帮助模型从其过去生成效果不佳的结果中学习。我们还利用面部的几何线索来优化面部细节，并使用专用的局部GAN细化每个关键身体部位。(3) 我们主张以前后帧序列而非单帧的方式生成前景，从而显式地加强时间一致性。在iPER、ComplexMotion、SoloDance、Fish和Mouse五个数据集上的实证结果表明，我们的方法能够生成逼真的目标视频，同时精确地复制源视频中的运动。我们的方法显著优于现有最先进的方法，在PSNR和FID指标上分别获得了7.2%和12.4%的提升。