Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results. However, the field remains underexplored regarding non-verbal communication despite evidence demonstrating its importance in human interaction. In particular, generating laughter sequences presents a unique challenge due to the intricacy and nuances of this behaviour. This paper aims to bridge this gap by proposing a novel model capable of generating realistic laughter sequences, given a still portrait and an audio clip containing laughter. We highlight the failure cases of traditional facial animation methods and leverage recent advances in diffusion models to produce convincing laughter videos. We train our model on a diverse set of laughter datasets and introduce an evaluation metric specifically designed for laughter. When compared with previous speech-driven approaches, our model achieves state-of-the-art performance across all metrics, even when these are re-trained for laughter generation. Our code and project are publicly available
翻译:语音驱动动画技术近年来取得显著进展,现有方法已能生成接近真实照片的效果。然而,尽管有证据表明非语言交流在人际互动中的重要性,该领域在非语言交流方面仍存在研究空白。特别是由于笑这一行为的复杂性和细微差别,生成笑声序列构成了独特挑战。本文旨在通过提出一种新颖模型来填补这一空白——该模型能够根据静态肖像和包含笑声的音频片段生成逼真的笑声序列。我们揭示了传统面部动画方法的失败案例,并利用扩散模型的最新进展来生成令人信服的笑声视频。模型在多种笑声数据集上训练,并引入专为笑声设计的评估指标。与先前语音驱动方法相比,即使将这些方法重新训练用于笑声生成,我们的模型在所有指标上仍达到最优性能。相关代码与项目已公开。