The majority of recent progress in Optical Music Recognition (OMR) has been achieved with Deep Learning methods, especially models following the end-to-end paradigm, reading input images and producing a linear sequence of tokens. Unfortunately, many music scores, especially piano music, cannot be easily converted to a linear sequence. This has led OMR researchers to use custom linearized encodings, instead of broadly accepted structured formats for music notation. Their diversity makes it difficult to compare the performance of OMR systems directly. To bring recent OMR model progress closer to useful results: (a) We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly and maintaining close cohesion and compatibility with the industry-standard MusicXML format. (b) We create a dev and test set for benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus. They contain 1,438 and 1,493 pianoform systems, each with an image from IMSLP. (c) We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model. We also test our model against the recently published synthetic pianoform dataset GrandStaff and surpass the state-of-the-art results.
翻译:近年来,光学乐谱识别(OMR)领域的大部分进展均依托于深度学习方法,尤其是遵循端到端范式、通过输入图像并生成线性符号序列的模型。然而,大量乐谱(尤其是钢琴乐谱)难以直接转化为线性序列。这促使OMR研究者采用自定义线性化编码方案,而非广泛接受的标准化乐谱结构。这种多样性使得不同OMR系统的性能难以直接比较。为使近期OMR模型的进展更贴近实用结果,我们提出以下贡献:(a) 定义了一种名为线性化MusicXML的序列化格式,可直接用于训练端到端模型,同时保持与行业标准MusicXML格式的高度兼容性与一致性。(b) 基于OpenScore Lieder语料库,构建了用于基准测试排版OMR的开发和测试集(以MusicXML作为真值),分别包含1,438个和1,493个钢琴乐谱系统(每个系统配有来自IMSLP的对应图像)。(c) 训练并微调了一个端到端模型作为该数据集的基线,并采用TEDn指标评估模型性能。我们还将该模型与近期公开的合成钢琴乐谱数据集GrandStaff进行对比,最终超越了当前最优结果。