Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multiinstruments scenario is under-explored. The challenges of the dance-driven multi-instruments music (MIDI) generation are two-fold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak correlation between music and video. To tackle these challenges, we build the first multi-instruments MIDI and dance paired dataset (D2MIDI). Based on our proposed dataset, we introduce a multi-instruments MIDI generation framework (Dance2MIDI) conditioned on dance video. Specifically, 1) to model the correlation between music and dance, we encode the dance motion using the GCN, and 2) to generate harmonious and coherent music, we employ Transformer to decode the MIDI sequence. We evaluate the generated music of our framework trained on D2MIDI dataset and demonstrate that our method outperforms existing methods. The data and code are available on the GitHub website.
翻译:舞蹈驱动的音乐生成旨在根据舞蹈视频生成音乐片段。以往的研究集中于单声道或原始音频生成,而多乐器场景的探索尚不充分。舞蹈驱动的多乐器音乐(MIDI)生成面临双重挑战:1)缺乏公开可用的多乐器MIDI与视频配对数据集;2)音乐与视频之间的弱相关性。为应对这些挑战,我们构建了首个多乐器MIDI与舞蹈配对数据集(D2MIDI)。基于所提出的数据集,我们引入了一个以舞蹈视频为条件的多乐器MIDI生成框架(Dance2MIDI)。具体而言:1)为建模音乐与舞蹈之间的关联,我们使用图卷积网络(GCN)对舞蹈动作进行编码;2)为生成和谐连贯的音乐,我们采用Transformer对MIDI序列进行解码。我们对基于D2MIDI数据集训练的框架生成的音乐进行评估,结果表明我们的方法优于现有方法。数据和代码已在GitHub网站公开。