With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.
翻译:随着基于扩散的视频生成技术的引入,音频条件化人物视频生成最近在动作自然性和肖像细节合成方面均取得重大突破。由于音频信号在驱动人体运动方面的控制能力有限,现有方法通常添加辅助空间信号以稳定运动,但这可能损害运动的自然性和自由度。本文提出一种端到端的纯音频条件化视频扩散模型,命名为Loopy。具体而言,我们设计了跨片段与片段内时序模块以及音频到潜变量模块,使模型能够利用数据中的长期运动信息来学习自然运动模式,并提升音频与肖像运动的相关性。该方法无需在推理阶段使用现有方法中用于约束运动的人工指定空间运动模板。大量实验表明,Loopy在多种场景下均优于近期音频驱动肖像扩散模型,能够生成更逼真且高质量的结果。