The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.
翻译:数字化乐谱对音乐资料的保存与可及性至关重要,但信息检索仍主要依赖基于标题或作曲家的元数据搜索。相较于文本文档,针对乐谱图像的内容检索虽对音乐家、音乐学家及教育工作者具有潜在价值,却仍处于探索不足的状态。本研究首先探讨了乐谱中哪些特征对检索最具相关性,并定义了从任意标注语料库构建查询数据集的系统化方法。我们同时考虑了多种基于内容的乐谱图像检索方法,涵盖依赖光学音乐识别(OMR)的转录式方法、可跳过转录环节的Transformer模型(该模型可直接从乐谱图像识别查询),以及基于文本提示的大型语言模型。实验在四个具有不同规模、图像质量和排版机制的语料库上评估了上述模型。总体而言,各方法在特定条件下表现优异:基于OMR的流水线在领域内检索中更优,而免转录模型则更有效应对领域变异性。