Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instruments scenario is under-explored. The challenges associated with the dance-driven multi-instrument music (MIDI) generation are twofold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak correlation between music and video. To tackle these challenges, we build the first multi-instruments MIDI and dance paired dataset (D2MIDI). Based on our proposed dataset, we introduce a multi-instruments MIDI generation framework (Dance2MIDI) conditioned on dance video. Specifically, 1) to capture the relationship between dance and music, we employ the Graph Convolutional Network to encode the dance motion. This allows us to extract features related to dance movement and dance style, 2) to generate a harmonious rhythm, we utilize a Transformer model to decode the drum track sequence, leveraging a cross-attention mechanism, and 3) we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task. A BERT-like model is employed to comprehend the context of the entire music piece through self-supervised learning. We evaluate the generated music of our framework trained on the D2MIDI dataset and demonstrate that our method achieves State-of-the-Art performance.
翻译:舞蹈驱动的音乐生成旨在根据舞蹈视频生成音乐片段。以往研究主要关注单声道或原始音频生成,而多乐器场景下的探索尚不充分。舞蹈驱动的多乐器MIDI生成面临两大挑战:1)缺乏公开可用的多乐器MIDI与视频配对数据集;2)音乐与视频之间的弱相关性。为应对这些挑战,我们构建了首个多乐器MIDI与舞蹈配对数据集(D2MIDI)。基于所提数据集,我们提出了一种以舞蹈视频为条件的多乐器MIDI生成框架(Dance2MIDI)。具体而言:1)为捕捉舞蹈与音乐的关系,采用图卷积网络对舞蹈动作进行编码,从而提取与舞蹈动作及风格相关的特征;2)为生成和谐节奏,利用Transformer模型通过交叉注意力机制解码鼓点音轨序列;3)将基于鼓点音轨生成其余音轨的任务建模为序列理解与补全任务,采用类BERT模型通过自监督学习理解整首音乐片段的上下文。通过在D2MIDI数据集上训练该框架,我们对生成的音乐进行评估,结果表明该方法达到了最优性能。