Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation

Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our \href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.

翻译：说话人视频生成旨在利用目标驱动视频中的运动信息，使静态图像中的人脸具有动态姿态和表情，同时保持源图像中人物的身份特征。然而，驱动视频中的剧烈复杂运动会导致生成结果模糊不清，因为静态源图像无法提供足够的外观信息来覆盖遮挡区域或呈现细微的表情变化，从而产生严重伪影并显著降低生成质量。针对这一问题，我们提出学习一个全局面部表示空间，并设计一种新颖的隐式身份表示条件记忆补偿网络（称为MCNet），用于高保真说话人视频生成。具体而言，我们设计了一个网络模块，从所有训练样本中学习统一的空域面部语义记忆库，该记忆库能提供丰富的面部结构和外观先验，以补偿经变形处理的源面部特征。此外，我们提出了一种基于源图像离散关键点学习的隐式身份表示的有效查询机制，该机制能极大促进从记忆库中检索更相关的补偿信息。大量实验表明，MCNet能够学习到具有代表性和互补性的面部记忆，并在VoxCeleb1和CelebV数据集上显著优于此前最先进的说话人视频生成方法。详情请访问我们的项目页面：\href{https://github.com/harlanhong/ICCV2023-MCNET}{项目链接}。