In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations. Experimental results demonsrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism.
翻译:摘要:本文旨在通过聚焦音频线索与面部动作之间的动态及微妙关系,提升说话人视频生成中的真实感与表现力。我们识别出传统技术在捕捉人类表情全谱及个体面部风格独特性方面的局限性。为应对这些挑战,我们提出EMO这一新颖框架,采用直接从音频到视频的合成方法,绕过对中间三维模型或面部关键点的需求。该方法确保视频中帧间过渡无缝衔接,并持续保持身份一致性,从而生成高度生动逼真的动画。实验结果表明,EMO不仅能生成令人信服的说话视频,还能生成多种风格的歌唱视频,在表现力与真实感方面显著优于现有最先进方法。