Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.
翻译:音乐生成为大型语言模型带来了具有挑战性的复杂性。音乐的符号结构通常包含纵向和声与横向对位,这促使针对大规模Transformer模型进行多种适应性调整与增强。然而,现有研究普遍存在三个主要缺陷:1)其分词方法依赖于领域特定标注(如小节与节拍),而这些标注在原始MIDI数据中通常缺失;2)在没有领域特定标注的情况下,增强词嵌入方法的纯粹影响难以评估;3)现有克服上述缺陷的研究(如MuseNet)缺乏可复现性。为应对这些局限,我们受MuseNet启发开发了一个基于MIDI的音乐生成框架,通过实证研究两种不依赖领域特定标注的结构嵌入方法。我们提供了多种可指导实际编码部署的评估指标与洞见,并验证了多种嵌入配置能够选择性地提升特定音乐维度。通过HuggingFace平台提供开源实现,我们的研究为利用大型语言模型实现实用且可复现的音乐生成提供了新的思路。