The digitization of vocal music scores presents unique challenges that go beyond traditional Optical Music Recognition (OMR) and Optical Character Recognition (OCR), as it necessitates preserving the critical alignment between music notation and lyrics. This alignment is essential for proper interpretation and processing in practical applications. This paper introduces and formalizes, for the first time, the Aligned Music Notation and Lyrics Transcription (AMNLT) challenge, which addresses the complete transcription of vocal scores by jointly considering music symbols, lyrics, and their synchronization. We analyze different approaches to address this challenge, ranging from traditional divide-and-conquer methods that handle music and lyrics separately, to novel end-to-end solutions including direct transcription, unfolding mechanisms, and language modeling. To evaluate these methods, we introduce four datasets of Gregorian chants, comprising both real and synthetic sources, along with custom metrics specifically designed to assess both transcription and alignment accuracy. Our experimental results demonstrate that end-to-end approaches generally outperform heuristic methods in the alignment challenge, with language models showing particular promise in scenarios where sufficient training data is available. This work establishes the first comprehensive framework for AMNLT, providing both theoretical foundations and practical solutions for preserving and digitizing vocal music heritage.
翻译:声乐乐谱的数字化提出了超越传统光学音乐识别(OMR)与光学字符识别(OCR)的独特挑战,因为它需要保持乐谱符号与歌词之间的关键对齐关系。这种对齐对于实际应用中的正确解读与处理至关重要。本文首次提出并形式化了对齐乐谱与歌词转录(AMNLT)这一挑战,该挑战通过联合考虑音乐符号、歌词及其同步关系,以解决声乐乐谱的完整转录问题。我们分析了应对这一挑战的不同方法,范围从分别处理音乐与歌词的传统分治策略,到新颖的端到端解决方案,包括直接转录、展开机制以及语言建模。为评估这些方法,我们引入了四个包含真实与合成来源的格里高利圣咏数据集,并设计了专门用于评估转录与对齐精度的定制指标。我们的实验结果表明,在应对对齐挑战时,端到端方法通常优于启发式方法,其中语言模型在拥有充足训练数据的场景下展现出特别的潜力。本研究建立了首个全面的AMNLT框架,为声乐文化遗产的保存与数字化提供了理论基础与实践方案。