Although a variety of transformers have been proposed for symbolic music generation in recent years, there is still little comprehensive study on how specific design choices affect the quality of the generated music. In this work, we systematically compare different datasets, model architectures, model sizes, and training strategies for the task of symbolic piano music generation. To support model development and evaluation, we examine a range of quantitative metrics and analyze how well they correlate with human judgment collected through listening studies. Our best-performing model, a 950M-parameter transformer trained on 80K MIDI files from diverse genres, produces outputs that are often rated as human-composed in a Turing-style listening survey.
翻译:尽管近年来已提出多种用于符号音乐生成的Transformer模型,但关于具体设计选择如何影响生成音乐质量的全面研究仍然匮乏。在本工作中,我们针对符号钢琴音乐生成任务,系统比较了不同数据集、模型架构、模型规模和训练策略。为支持模型开发与评估,我们考察了一系列量化指标,并分析了这些指标与通过听辨实验收集的人类评价之间的相关性。我们性能最佳的模型——一个基于八万首多流派MIDI文件训练的9.5亿参数Transformer——在图灵式听辨调查中,其生成作品常被判定为人类创作。