Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.
翻译:受到扩散模型在视觉生成任务中最新发展的启发,我们提出了一种利用去噪扩散模型实现端到端语音驱动视频编辑的方法。给定一段说话人的视频和一段独立的语音录音,唇部和下颌运动能够重新同步,而无需依赖中间结构表示(例如面部关键点或3D人脸模型)。我们证明,通过将去噪扩散模型以音频梅尔频谱特征为条件生成同步的面部运动,这一方法是可行的。在单说话人和多说话人视频编辑任务中展示了概念验证结果,并在CREMA-D视听数据集上提供了基线模型。据我们所知,这是首次验证并证实将端到端去噪扩散模型应用于音频驱动视频编辑任务的可行性。