Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.
翻译:说话人脸生成是一项具有挑战性的任务,旨在合成与给定音频精确同步的自然逼真面部。由于协同发音现象(即孤立音位会受到前后音位的影响),音位的发音随音位上下文的变化而变化。因此,结合音位上下文建模唇部运动能够生成时空对齐更优的唇部动作。本研究探究了音位上下文在说话人脸生成中唇部运动建模的作用,提出了一种上下文感知的唇形同步框架(CALS),该框架显式利用音位上下文生成目标人脸的唇部运动。CALS由音频到唇部模块和唇部到人脸模块组成。前者基于掩码学习预训练,将每个音位映射为上下文感知的唇部动作单元。该上下文感知唇部动作单元进而指导后者合成具有上下文感知唇部运动的目标身份。通过大量实验,我们验证了在CALS框架中利用音位上下文可有效增强时空对齐性能,并进一步展示了音位上下文在唇形同步中的辅助程度,发现唇部生成的最佳上下文窗口时长约为1.2秒。