In the age of music streaming platforms, the task of automatically tagging music audio has garnered significant attention, driving researchers to devise methods aimed at enhancing performance metrics on standard datasets. Most recent approaches rely on deep neural networks, which, despite their impressive performance, possess opacity, making it challenging to elucidate their output for a given input. While the issue of interpretability has been emphasized in other fields like medicine, it has not received attention in music-related tasks. In this study, we explored the relevance of interpretability in the context of automatic music tagging. We constructed a workflow that incorporates three different information extraction techniques: a) leveraging symbolic knowledge, b) utilizing auxiliary deep neural networks, and c) employing signal processing to extract perceptual features from audio files. These features were subsequently used to train an interpretable machine-learning model for tag prediction. We conducted experiments on two datasets, namely the MTG-Jamendo dataset and the GTZAN dataset. Our method surpassed the performance of baseline models in both tasks and, in certain instances, demonstrated competitiveness with the current state-of-the-art. We conclude that there are use cases where the deterioration in performance is outweighed by the value of interpretability.
翻译:在音乐流媒体平台盛行的时代,自动标注音乐音频的任务引起了广泛关注,促使研究者设计方法以提升标准数据集上的性能指标。近期大多数方法依赖于深度神经网络,尽管这些网络表现出色,但其内部机制不透明,使得对于给定输入难以解释其输出。尽管可解释性问题在医学等领域已得到强调,但在音乐相关任务中尚未受到重视。本研究探讨了自动音乐标注中可解释性的重要性。我们构建了一个工作流程,融合了三种不同的信息提取技术:a) 利用符号知识,b) 借助辅助深度神经网络,以及c) 采用信号处理从音频文件中提取感知特征。这些特征随后被用于训练可解释的机器学习模型以进行标签预测。我们在两个数据集(即MTG-Jamendo数据集和GTZAN数据集)上进行了实验。我们的方法在两个任务中均超越了基线模型的性能,并且在某些情况下展现出与当前最先进方法相竞争的能力。我们得出结论:在某些应用场景中,性能下降可由可解释性的价值所弥补。