The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.
翻译:乐谱数字化对其保存与可及性至关重要,然而信息检索仍主要依赖元数据搜索(如按标题或作曲家)。与文本文档相比,基于乐谱图像的内容搜索尚未得到充分探索,尽管其对音乐家、音乐学家和教育工作者具有潜在价值。本研究首先探究乐谱中哪些特征对搜索最为相关,并定义了一种从任意标注语料库构建查询数据集的系统性方法。我们还考虑了乐谱图像内容检索的多种方法,从依赖光学音乐识别(OMR)的基于转录方法,到无需转录的Transformer模型(该模型经过训练可直接从乐谱图像识别查询),以及基于文本提示的大语言模型。实验在四个具有不同数据集规模、图像质量和排版机制的语料库上评估了这些模型。总体而言,每种方法在不同条件下各有所长:基于OMR的流水线在领域内检索中表现更优,而无转录模型则更有效地处理领域变异性。