Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multiinstruments scenario is under-explored. The challenges of the dance-driven multi-instruments music (MIDI) generation are two-fold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak correlation between music and video. To tackle these challenges, we build the first multi-instruments MIDI and dance paired dataset (D2MIDI). Based on our proposed dataset, we introduce a multi-instruments MIDI generation framework (Dance2MIDI) conditioned on dance video. Specifically, 1) to model the correlation between music and dance, we encode the dance motion using the GCN, and 2) to generate harmonious and coherent music, we employ Transformer to decode the MIDI sequence. We evaluate the generated music of our framework trained on D2MIDI dataset and demonstrate that our method outperforms existing methods. The data and code are available on https://github.com/Dance2MIDI/Dance2MIDI
翻译:舞蹈驱动音乐生成旨在根据舞蹈视频生成音乐片段。以往研究集中在单声道或原始音频生成领域,而多乐器场景尚未得到充分探索。舞蹈驱动多乐器MIDI音乐生成面临双重挑战:1) 缺乏公开可用的多乐器MIDI与视频配对数据集;2) 音乐与视频之间的弱相关性。为应对这些挑战,我们构建了首个多乐器MIDI与舞蹈配对数据集(D2MIDI)。基于该数据集,我们提出一个以舞蹈视频为条件的多乐器MIDI生成框架(Dance2MIDI)。具体而言:1) 为建模音乐与舞蹈的关联性,我们采用图卷积网络(GCN)编码舞蹈动作;2) 为生成和谐连贯的音乐,我们运用Transformer解码MIDI序列。基于D2MIDI数据集训练的框架生成的音乐评估结果表明,本方法优于现有方法。数据与代码已开源在https://github.com/Dance2MIDI/Dance2MIDI