Optical Music Recognition (OMR) is an important technology in music and has been researched for a long time. Previous approaches for OMR are usually based on CNN for image understanding and RNN for music symbol classification. In this paper, we propose a transformer-based approach with excellent global perceptual capability for end-to-end polyphonic OMR, called TrOMR. We also introduce a novel consistency loss function and a reasonable approach for data annotation to improve recognition accuracy for complex music scores. Extensive experiments demonstrate that TrOMR outperforms current OMR methods, especially in real-world scenarios. We also develop a TrOMR system and build a camera scene dataset for full-page music scores in real-world. The code and datasets will be made available for reproducibility.
翻译:光学音乐识别(OMR)是音乐领域的一项重要技术,已有较长研究历史。以往的OMR方法通常基于卷积神经网络(CNN)进行图像理解,并借助循环神经网络(RNN)进行音乐符号分类。本文提出了一种基于Transformer的端到端多声部OMR方法——TrOMR,该方法具备卓越的全局感知能力。我们还引入了一种新颖的一致性损失函数以及合理的数据标注方法,以提高复杂乐谱的识别准确率。大量实验表明,TrOMR在性能上优于现有OMR方法,尤其在真实场景中表现突出。此外,我们开发了TrOMR系统,并构建了一个针对全页乐谱的摄像头场景数据集。为保障研究可重复性,相关代码与数据集将对外开放。