Dance-to-music generation aims to generate music that is aligned with dance movements. Existing approaches typically rely on body motion features extracted from a single human dancer and limited dance-to-music datasets, which restrict their performance and applicability to real-world scenarios involving multiple dancers and non-human dancers. In this paper, we propose PF-D2M, a universal diffusion-based dance-to-music generation model that incorporates visual features extracted from dance videos. PF-D2M is trained with a progressive training strategy that effectively addresses data scarcity and generalization challenges. Both objective and subjective evaluations show that PF-D2M achieves state-of-the-art performance in dance-music alignment and music quality.
翻译:舞蹈到音乐生成旨在生成与舞蹈动作相协调的音乐。现有方法通常依赖于从单个舞者提取的身体运动特征以及有限的舞蹈到音乐数据集,这限制了其在涉及多个舞者及非人类舞者的真实场景中的性能与适用性。本文提出PF-D2M,一种通用的、基于扩散的舞蹈到音乐生成模型,该模型融合了从舞蹈视频中提取的视觉特征。PF-D2M采用渐进式训练策略进行训练,有效解决了数据稀缺和泛化挑战。客观与主观评估均表明,PF-D2M在舞蹈-音乐对齐度和音乐质量方面均达到了最先进的性能。