We propose a novel talking head synthesis pipeline called "DiT-Head", which is based on diffusion transformers and uses audio as a condition to drive the denoising process of a diffusion model. Our method is scalable and can generalise to multiple identities while producing high-quality results. We train and evaluate our proposed approach and compare it against existing methods of talking head synthesis. We show that our model can compete with these methods in terms of visual quality and lip-sync accuracy. Our results highlight the potential of our proposed approach to be used for a wide range of applications, including virtual assistants, entertainment, and education. For a video demonstration of the results and our user study, please refer to our supplementary material.
翻译:我们提出了一种名为"DiT-Head"的新型说话人脸合成框架,该框架基于扩散变换器,利用音频作为条件驱动扩散模型的去噪过程。我们的方法具有可扩展性,能够泛化到多种人物身份,同时生成高质量的结果。我们对所提出的方法进行了训练与评估,并与现有说话人脸合成方法进行了比较。实验表明,我们的模型在视觉质量和唇音同步精度方面可媲美这些方法。研究结果突显了该方案在虚拟助手、娱乐及教育等广泛领域的应用潜力。详细结果视频演示及用户研究请参见补充材料。