In the past, the field of drum source separation faced significant challenges due to limited data availability, hindering the adoption of cutting-edge deep learning methods that have found success in other related audio applications. In this manuscript, we introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems. Each audio clip is synthesized from MIDI recordings of expressive drums performances using ten real-sounding acoustic drum kits. Totaling 1224 hours, StemGMD is the largest audio dataset of drums to date and the first to comprise isolated audio clips for every instrument in a canonical nine-piece drum kit. We leverage StemGMD to develop LarsNet, a novel deep drum source separation model. Through a bank of dedicated U-Nets, LarsNet can separate five stems from a stereo drum mixture faster than real-time and is shown to significantly outperform state-of-the-art nonnegative spectro-temporal factorization methods.
翻译:过去,鼓声源分离领域因数据可用性有限而面临重大挑战,阻碍了采用在其它相关音频应用中获得成功的先进深度学习方法。本文中,我们介绍了StemGMD,一个大规模孤立单乐器鼓声部音频数据集。每个音频片段均通过使用十套真实感声学鼓组,由表现力丰富的鼓演奏MIDI录音合成而成。总计达1224小时,StemGMD是迄今为止最大的鼓音频数据集,也是首个包含标准九件套鼓组中每件乐器独立音频片段的数据集。我们利用StemGMD开发了LarsNet,一种新颖的深度鼓声源分离模型。通过一组专用U-Net,LarsNet能够以快于实时的速度从立体声鼓混合信号中分离出五个声部,并显著优于当前最先进的非负频谱-时域因子分解方法。