Neural models are one of the most popular approaches for music generation, yet there aren't standard large datasets tailored for learning music directly from game data. To address this research gap, we introduce a novel dataset named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is built upon the Nintendo Entertainment System Music Database (NES-MDB), encompassing 5,278 music pieces from 397 NES games. Our approach involves collecting long-play videos for 389 games of the original dataset, slicing them into 15-second-long clips, and extracting the audio from each clip. Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to automatically identify the corresponding piece in the NES-MDB dataset. Additionally, we introduce a baseline method based on the Controllable Music Transformer to generate NES music conditioned on gameplay clips. We evaluated this approach with objective metrics, and the results showed that the conditional CMT improves musical structural quality when compared to its unconditional counterpart. Moreover, we used a neural classifier to predict the game genre of the generated pieces. Results showed that the CMT generator can learn correlations between gameplay videos and game genres, but further research has to be conducted to achieve human-level performance.
翻译:神经模型是音乐生成最流行的方法之一,但目前缺乏专门用于直接从游戏数据中学习音乐的标准大型数据集。为填补这一研究空白,我们提出了一个名为NES-VMDB的新数据集,包含来自389款NES游戏的98,940个游戏视频,每个视频均配有符号化格式(MIDI)的原始配乐。NES-VMDB基于任天堂娱乐系统音乐数据库(NES-MDB)构建,涵盖397款NES游戏的5,278首乐曲。我们的方法包括收集原始数据集中389款游戏的完整游戏流程视频,将其切割为15秒长的片段,并从每个片段中提取音频。随后,我们应用音频指纹算法(类似于Shazam)自动识别NES-MDB数据集中对应的乐曲。此外,我们提出了一种基于可控音乐Transformer(CMT)的基准方法,用于生成以游戏片段为条件的NES音乐。我们通过客观指标评估了该方法,结果表明与无条件的CMT相比,有条件的CMT在音乐结构质量上有所提升。同时,我们使用神经分类器预测生成乐曲的游戏类型,结果显示CMT生成器能够学习游戏视频与游戏类型之间的相关性,但要达到人类级性能仍需进一步研究。