Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

Generating dance from music is crucial for advancing automated choreography. Current methods typically produce skeleton keypoint sequences instead of dance videos and lack the capability to make specific individuals dance, which reduces their real-world applicability. These methods also require precise keypoint annotations, complicating data collection and limiting the use of self-collected video datasets. To overcome these challenges, we introduce a novel task: generating dance videos directly from images of individuals guided by music. This task enables the dance generation of specific individuals without requiring keypoint annotations, making it more versatile and applicable to various situations. Our solution, the Dance Any Beat Diffusion model (DabFusion), utilizes a reference image and a music piece to generate dance videos featuring various dance types and choreographies. The music is analyzed by our specially designed music encoder, which identifies essential features including dance style, movement, and rhythm. DabFusion excels in generating dance videos not only for individuals in the training dataset but also for any previously unseen person. This versatility stems from its approach of generating latent optical flow, which contains all necessary motion information to animate any person in the image. We evaluate DabFusion's performance using the AIST++ dataset, focusing on video quality, audio-video synchronization, and motion-music alignment. We propose a 2D Motion-Music Alignment Score (2D-MM Align), which builds on the Beat Alignment Score to more effectively evaluate motion-music alignment for this new task. Experiments show that our DabFusion establishes a solid baseline for this innovative task. Video results can be found on our project page: https://DabFusion.github.io.

翻译：从音乐生成舞蹈对于推进自动化编舞至关重要。现有方法通常生成骨骼关键点序列而非舞蹈视频，且缺乏使特定个体起舞的能力，这降低了其实际应用价值。这些方法还需要精确的关键点标注，使数据收集复杂化并限制了自采集视频数据集的使用。为克服这些挑战，我们提出一项新颖任务：在音乐引导下直接从个体图像生成舞蹈视频。该任务无需关键点标注即可实现特定个体的舞蹈生成，使其更具通用性并适用于多种场景。我们的解决方案Dance Any Beat Diffusion模型（DabFusion）利用参考图像和音乐片段，生成包含多种舞蹈类型与编舞的舞蹈视频。音乐通过我们专门设计的音乐编码器进行分析，该编码器可识别包括舞蹈风格、动作和节奏在内的关键特征。DabFusion不仅能生成训练数据集中个体的舞蹈视频，还能为任何未见过的个体生成舞蹈视频。这种通用性源于其生成潜在光流的方法，该光流包含驱动图像中任意个体运动所需的全部运动信息。我们使用AIST++数据集评估DabFusion的性能，重点关注视频质量、音视频同步性及动作-音乐对齐度。我们提出2D动作-音乐对齐分数（2D-MM Align），该指标基于节拍对齐分数构建，能更有效地评估此新任务中的动作-音乐对齐度。实验表明我们的DabFusion为这项创新任务建立了坚实的基准。视频结果可在项目页面查看：https://DabFusion.github.io。