Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multiinstruments scenario is under-explored. The challenges of the dance-driven multi-instruments music (MIDI) generation are two-fold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak correlation between music and video. To tackle these challenges, we build the first multi-instruments MIDI and dance paired dataset (D2MIDI). Based on our proposed dataset, we introduce a multi-instruments MIDI generation framework (Dance2MIDI) conditioned on dance video. Specifically, 1) to model the correlation between music and dance, we encode the dance motion using the GCN, and 2) to generate harmonious and coherent music, we employ Transformer to decode the MIDI sequence. We evaluate the generated music of our framework trained on D2MIDI dataset and demonstrate that our method outperforms existing methods. The data and code are available on the GitHub website.
翻译:舞蹈驱动的音乐生成旨在根据舞蹈视频生成音乐片段。以往的研究聚焦于单声道或原始音频生成,而多乐器场景尚未得到充分探索。舞蹈驱动的多乐器音乐(MIDI)生成面临两大挑战:1)缺乏公开可用的多乐器MIDI与视频配对数据集;2)音乐与视频之间的弱相关性。为应对这些挑战,我们构建了首个多乐器MIDI与舞蹈配对数据集(D2MIDI)。基于所提出的数据集,我们引入了一个以舞蹈视频为条件的多乐器MIDI生成框架(Dance2MIDI)。具体而言:1)为建模音乐与舞蹈的相关性,我们使用GCN编码舞蹈动作;2)为生成和谐连贯的音乐,我们采用Transformer解码MIDI序列。通过评估在D2MIDI数据集上训练的框架所生成的音乐,我们证明该方法优于现有方法。数据和代码已在GitHub上开源。