State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.
翻译:目前最先进的端到端光学音乐识别(OMR)主要采用单声道转录技术处理复杂乐谱布局(如复调音乐),通常需借助简化或特定适配方法。尽管这些方法具有有效性,但仍面临可扩展性受限和局限性等挑战。本文提出乐谱变换器(Sheet Music Transformer),这是首个无需完全依赖单声道策略即可转录复杂乐谱的端到端OMR模型。该模型采用基于Transformer的图像到序列框架,能从输入图像直接预测标准数字音乐编码格式的乐谱转录结果。我们在两个复调音乐数据集上对模型进行了测试,证明其能有效处理这些复杂音乐结构。实验结果表明,该模型不仅具备卓越性能,更超越了现有最佳方法,从而推动了端到端OMR转录领域的技术进步。