Creating realistic, natural, and lip-readable talking face videos remains a formidable challenge. Previous research primarily concentrated on generating and aligning single-frame images while overlooking the smoothness of frame-to-frame transitions and temporal dependencies. This often compromised visual quality and effects in practical settings, particularly when handling complex facial data and audio content, which frequently led to semantically incongruent visual illusions. Specifically, synthesized videos commonly featured disorganized lip movements, making them difficult to understand and recognize. To overcome these limitations, this paper introduces the application of optical flow to guide facial image generation, enhancing inter-frame continuity and semantic consistency. We propose "OpFlowTalker", a novel approach that utilizes predicted optical flow changes from audio inputs rather than direct image predictions. This method smooths image transitions and aligns changes with semantic content. Moreover, it employs a sequence fusion technique to replace the independent generation of single frames, thus preserving contextual information and maintaining temporal coherence. We also developed an optical flow synchronization module that regulates both full-face and lip movements, optimizing visual synthesis by balancing regional dynamics. Furthermore, we introduce a Visual Text Consistency Score (VTCS) that accurately measures lip-readability in synthesized videos. Extensive empirical evidence validates the effectiveness of our approach.
翻译:生成逼真、自然且唇语可读的说话人脸视频仍然是一项艰巨的挑战。先前的研究主要集中于生成和对齐单帧图像,而忽视了帧间过渡的平滑性与时间依赖性。这在实际场景中常常会损害视觉质量和效果,尤其是在处理复杂的面部数据和音频内容时,经常导致语义不一致的视觉错觉。具体而言,合成视频常出现唇部动作混乱,使其难以理解和识别。为克服这些局限,本文引入了光流应用来引导面部图像生成,以增强帧间连续性与语义一致性。我们提出了"OpFlowTalker"这一新方法,它利用从音频输入预测的光流变化而非直接预测图像。该方法平滑了图像过渡并使变化与语义内容对齐。此外,它采用序列融合技术替代单帧的独立生成,从而保留上下文信息并维持时间连贯性。我们还开发了一个光流同步模块,用于调控全脸及唇部运动,通过平衡区域动态来优化视觉合成。进一步地,我们引入了一种视觉文本一致性评分(VTCS),用于精确衡量合成视频的唇语可读性。大量的实证证据验证了我们方法的有效性。