Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

翻译：音频驱动的说话头生成技术发展迅速，然而现有评估标准主要依赖逐帧指标，这些指标假设生成视频与参考视频之间存在严格的时间对应关系。这一假设与语音驱动的面部运动特性不符，因为自然的面部运动包含轻微的时间偏移、不同的说话速度及风格变化。因此，传统指标可能将无害的时间差异误判为质量错误，导致难以公平比较不同方法并理解其性能权衡。本文提出，动态生成模型的评估应被构建为序列对齐问题，而非独立的帧比较。我们引入了一种统一的序列级重构方案，将软动态时间规整整合至现有评估流程中。通过在对齐特征轨迹的同时保持时间顺序，所提框架能够容忍有界的时间错位，且无需改变底层的感知、身份或同步编码器。研究表明，刚性对齐下的逐帧评估可被视为该框架的特例，而序列级对齐能提供更优的稳定性、更低的时间差异敏感性，以及更清晰的建模范式区分。基于这一原则性表述，我们构建了涵盖标准、野外及风格多样场景的大规模基准测试，涉及七组数据集上的二十种方法。大量实验表明，时间对齐指标对时间差异更具鲁棒性，在不同数据集上结果更一致，并能更有效地揭示建模范式间的系统性权衡，例如同步性与真实感、表现力与稳定性之间的权衡。