Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.
翻译:从符号表示直接生成逼真的鼓声学音频是一项融合音乐感知与机器学习的挑战性任务。我们提出了一种系统,通过预测神经音频编解码器的离散编码,将表达性鼓点网格(一种包含微时序和力度信息的时间对齐MIDI表示)转换为鼓声学音频。该方法采用基于Transformer的模型将鼓点网格输入映射为编解码器令牌序列,再通过预训练的编解码器解码器将其转换为波形音频。我们实验了多个最先进的神经编解码器(EnCodec、DAC和X-Codec),以评估音频表示选择对生成鼓声质量的影响。该系统在扩展律动MIDI数据集(E-GMD)——一个包含大量人类鼓表演配对MIDI与音频的数据集——上进行了训练与评估。通过客观指标评测生成音频的保真度与音乐对齐性。总体而言,我们的结果证实了编解码器令牌预测是实现鼓点网格到音频生成的有效路径,并为打击乐合成中选择音频分词器提供了实用见解。