The automated creation of accurate musical notation from an expressive human performance is a fundamental task in computational musicology. To this end, we present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files. We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data. Framing the task as sequence-to-sequence translation rather than note-wise classification reduces alignment requirements and annotation costs, while allowing the prediction of more concise and accurate notation. To serialize symbolic music data, we design a custom tokenization stage based on compound tokens that carefully quantizes continuous values. This technique preserves more score information while reducing sequence lengths by $3.5\times$ compared to prior approaches. Using the transformer backbone, our method demonstrates better understanding of note values, rhythmic structure, and details such as staff assignment. When evaluated end-to-end using transcription metrics such as MUSTER, we achieve significant improvements over previous deep learning approaches and complex HMM-based state-of-the-art pipelines. Our method is also the first to directly predict notational details like trill marks or stem direction from performance data. Code and models are available at https://github.com/TimFelixBeyer/MIDI2ScoreTransformer
翻译:从富有表现力的人类演奏中自动生成精确的乐谱是计算音乐学的一项基础任务。为此,我们提出了一种端到端的深度学习方法,能够直接从真实世界的钢琴演奏MIDI文件构建详细的乐谱。我们引入了一种基于现代Transformer的架构,并采用了一种新颖的符号音乐数据令牌化表示方法。将该任务构建为序列到序列的翻译而非音符级分类,降低了对齐要求和标注成本,同时能够预测更简洁、更准确的记谱。为了序列化符号音乐数据,我们设计了一个基于复合令牌的自定义令牌化阶段,该阶段能精细量化连续值。与先前方法相比,该技术在保留更多乐谱信息的同时,将序列长度减少了$3.5\times$。利用Transformer骨干网络,我们的方法在理解音符时值、节奏结构以及谱表分配等细节方面表现更优。在使用MUSTER等转录指标进行端到端评估时,我们的方法相较于以往的深度学习方法以及复杂的基于隐马尔可夫模型的最先进流程取得了显著提升。我们的方法也是首个能够直接从演奏数据预测颤音记号或符干方向等记谱细节的模型。代码和模型可在 https://github.com/TimFelixBeyer/MIDI2ScoreTransformer 获取。