Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. However, their mere aggregation does not lead to improved performance. We suspect this is due to a shortage of paired audio-4D data, which is crucial for the Transformer to effectively perform as a denoiser within the Diffusion framework. To tackle this issue, we present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules. These modules serve as substitutes for the traditional self/cross-attention in standard Transformers, incorporating thoughtfully designed biases that steer the attention mechanisms to concentrate on both the relevant task-specific and diffusion-related conditions. We also explore the trade-off between accurate lip synchronization and non-verbal facial expressions within the Diffusion paradigm. Experiments show our model not only achieves state-of-the-art performance on existing benchmarks, but also fast inference speed owing to its ability to generate facial motions in parallel.
翻译:摘要:语音驱动的三维面部动画对许多多媒体应用至关重要。近期研究已证明,单独使用扩散模型或变换器架构在此任务中具有潜力。然而,两者简单结合并不能提升性能。我们推测这是由于配对音频-4D数据稀缺,而此类数据对变换器在扩散框架中有效执行去噪任务至关重要。为解决此问题,我们提出DiffSpeaker——一种配备新型偏置条件注意力模块的变换器网络。这些模块替代了标准变换器中的传统自注意/交叉注意机制,通过精心设计的偏置引导注意力机制聚焦于任务特定和扩散相关条件。我们还探索了扩散范式下精确唇部同步与非言语面部表情之间的权衡。实验表明,本模型不仅能在现有基准上达到最优性能,且因具备并行生成面部运动的能力而实现快速推理速度。