The task of generating dance from music is crucial, yet current methods, which mainly produce joint sequences, lead to outputs that lack intuitiveness and complicate data collection due to the necessity for precise joint annotations. We introduce a Dance Any Beat Diffusion model, namely DabFusion, that employs music as a conditional input to directly create dance videos from still images, utilizing conditional image-to-video generation principles. This approach pioneers the use of music as a conditioning factor in image-to-video synthesis. Our method unfolds in two stages: training an auto-encoder to predict latent optical flow between reference and driving frames, eliminating the need for joint annotation, and training a U-Net-based diffusion model to produce these latent optical flows guided by music rhythm encoded by CLAP. Although capable of producing high-quality dance videos, the baseline model struggles with rhythm alignment. We enhance the model by adding beat information, improving synchronization. We introduce a 2D motion-music alignment score (2D-MM Align) for quantitative assessment. Evaluated on the AIST++ dataset, our enhanced model shows marked improvements in 2D-MM Align score and established metrics. Video results can be found on our project page: https://DabFusion.github.io.
翻译:舞蹈生成任务从音乐到动作的转换至关重要,但现有方法主要生成关节序列,导致输出缺乏直观性,且因需精确关节标注导致数据采集复杂。我们提出《舞动节拍》扩散模型(DabFusion),该模型将音乐作为条件输入,基于图像到视频的条件生成原理,直接从静态图像生成舞蹈视频。该方法开创性地将音乐作为图像-视频合成中的条件因素。模型分两阶段实现:先训练自编码器预测参考帧与驱动帧间的潜在光流(无需关节标注),再训练基于U-Net的扩散模型,根据CLAP编码的音乐节奏生成潜在光流。尽管能生成高质量舞蹈视频,基线模型在节奏对齐方面存在不足。我们通过添加节拍信息增强模型同步性能,并提出二维运动-音乐对齐评分(2D-MM Align)进行量化评估。在AIST++数据集上的评估显示,增强模型在2D-MM对齐评分及现有指标上均有显著提升。视频结果见项目主页:https://DabFusion.github.io。