Music auto-tagging is essential for organizing and discovering music in extensive digital libraries. While foundation models achieve exceptional performance in this domain, their outputs often lack interpretability, limiting trust and usability for researchers and end-users alike. In this work, we present an interpretable framework for music auto-tagging that leverages groups of musically meaningful multimodal features, derived from signal processing, deep learning, ontology engineering, and natural language processing. To enhance interpretability, we cluster features semantically and employ an expectation maximization algorithm, assigning distinct weights to each group based on its contribution to the tagging process. Our method achieves competitive tagging performance while offering a deeper understanding of the decision-making process, paving the way for more transparent and user-centric music tagging systems.
翻译:音乐自动标注对于大规模数字图书馆中的音乐组织与发现至关重要。尽管基础模型在此领域取得了卓越性能,但其输出往往缺乏可解释性,限制了研究人员和最终用户的信任度与实用性。本研究提出了一种可解释的音乐自动标注框架,该框架整合了来自信号处理、深度学习、本体工程和自然语言处理等领域的多组具有音乐意义的多模态特征。为增强可解释性,我们对特征进行语义聚类,并采用期望最大化算法,根据每组特征对标注过程的贡献度分配相应权重。该方法在实现具有竞争力的标注性能的同时,提供了对决策过程的更深层理解,为构建更透明、更以用户为中心的音乐标注系统铺平了道路。