Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results. However, the field remains underexplored regarding non-verbal communication despite evidence demonstrating its importance in human interaction. In particular, generating laughter sequences presents a unique challenge due to the intricacy and nuances of this behaviour. This paper aims to bridge this gap by proposing a novel model capable of generating realistic laughter sequences, given a still portrait and an audio clip containing laughter. We highlight the failure cases of traditional facial animation methods and leverage recent advances in diffusion models to produce convincing laughter videos. We train our model on a diverse set of laughter datasets and introduce an evaluation metric specifically designed for laughter. When compared with previous speech-driven approaches, our model achieves state-of-the-art performance across all metrics, even when these are re-trained for laughter generation.
翻译:语音驱动动画近年来已取得显著进展,当前方法可实现近乎照片级真实感的效果。然而,尽管非言语交流在人类互动中的重要性已获实证,该领域对此仍探索不足。特别是,笑声序列的生成因其行为的复杂性与细微差别而构成独特挑战。本文旨在弥合这一研究空白,提出一种能够根据静态肖像与含笑声音频片段生成逼真笑声序列的新型模型。我们揭示了传统面部动画方法的失败案例,并利用扩散模型的最新进展生成令人信服的笑声视频。模型在多样化的笑声数据集上进行训练,并引入专为笑声设计的评估指标。与先前语音驱动方法相比,即使这些方法针对笑声生成重新训练,我们的模型在所有指标上均达到最优性能。