Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.
翻译:节奏转录是乐谱级自动音乐转录(AMT)的关键子任务。尽管深度学习模型已被广泛应用于检测音频和MIDI演奏中的节拍网格,但基于节拍的节奏量化方法仍鲜有探索。本文提出了一种新颖的深度学习框架,通过利用先验节拍信息对MIDI演奏进行量化。该方法采用Transformer架构,有效处理同步的乐谱与演奏数据以训练量化模型。核心组件包括:数据集构建、基于节拍的预量化方法(将演奏时间与乐谱时间对齐至统一框架),以及专为此任务设计的MIDI分词器。我们基于T5架构改进Transformer模型,以满足节奏量化的特殊需求。通过设计一套乐谱级量化性能客观评估指标,系统优化了数据表示与模型架构。此外,采用演奏与乐谱增强策略(如移调、音符删除和演奏侧时间抖动)提升模型鲁棒性。最后,通过定性分析,在多个示例曲目中将本模型与当前最优的概率模型及深度学习模型进行量化性能对比。在ASAP数据集上,本模型达到97.3%的起始点F1分数与83.3%的音符时值准确率,能够良好泛化至训练中未见的节拍类型,并生成可读性强的乐谱输出。通过乐器特定数据集的微调,模型进一步捕捉了特征性节奏与旋律模式,提升了性能。本研究为基于Transformer模型的节拍级MIDI量化提供了鲁棒且灵活的框架。