Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synchronization. We identify that a style-based generator would sufficiently enable such a charming property on both one-shot and few-shot scenarios. Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face. The mouth shapes are accurately modified by audio through modulated convolutions. Moreover, our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames. Thus the identity and talking style of a target person could be accurately preserved. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results on a variety of scenes. Resources can be found at https://hangz-nju-cuhk.github.io/projects/StyleSync.
翻译:尽管近年来在唇部运动与任意音频波形同步方面取得了进展,现有方法仍难以在生成质量与模型泛化能力之间取得平衡。以往研究要么需要长时间数据进行训练,要么对所有对象产生相似的低质量运动模式。本文提出StyleSync,一个实现高保真唇形同步的有效框架。我们发现,基于风格的生成器能够充分在单样本和少样本场景中实现这一特性。具体而言,我们设计了掩膜引导的空间信息编码模块,保留给定人脸的细节。通过调制卷积,唇形由音频精确修改。此外,我们的设计通过引入风格空间和仅在有限帧上进行生成器微调,实现了个性化唇形同步。因此,目标人物的身份和说话风格得以精确保留。大量实验证明了该方法在各种场景下生成高保真结果的有效性。资源链接见https://hangz-nju-cuhk.github.io/projects/StyleSync。