Video editing-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leakage, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.
翻译:基于视频编辑的说话人脸生成旨在保持视频细节(如姿态、光照和手势)的同时仅修改唇部运动,通常使用身份参考图像来维持说话者一致性。然而,这种机制可能引发唇部泄露现象,即生成的唇部受到参考图像影响而非完全由驱动音频决定。此类泄露难以通过标准指标和常规测试设置进行检测。为此,我们提出一种系统性评估方法来分析与量化唇部泄露。我们的框架采用三种互补的测试设置:静音输入生成、不匹配的音频-视频配对以及匹配的音频-视频合成。同时,我们引入衍生指标,包括唇语同步差异和基于静音音频的唇语同步分数。此外,我们研究了不同身份参考选择对泄露的影响,为参考设计提供洞见。所提出的方法具有模型无关性,为未来说话人脸生成研究建立了更可靠的基准。