The task of Raga Identification is a very popular research problem in Music Information Retrieval. Few studies that have explored this task employed various approaches, such as signal processing, Machine Learning (ML) methods, and more recently Deep Learning (DL) based methods. However, a key question remains unanswered in all of these works: do these ML/DL methods learn and interpret Ragas in a manner similar to human experts? Besides, a significant roadblock in this research is the unavailability of ample supply of rich, labeled datasets, which drives these ML/DL based methods. In this paper, we introduce "Prasarbharti Indian Music" version-1 (PIM-v1), a novel dataset comprising of 191 hours of meticulously labeled Hindustani Classical Music (HCM) recordings, which is the largest labeled dataset for HCM recordings to the best of our knowledge. Our approach involves conducting ablation studies to find the benchmark classification model for Automatic Raga Identification (ARI) using PIM-v1 dataset. We achieve a chunk-wise f1-score of 0.89 for a subset of 12 Raga classes. Subsequently, we employ model explainability techniques to evaluate the classifier's predictions, aiming to ascertain whether they align with human understanding of Ragas or are driven by arbitrary patterns. We validate the correctness of model's predictions by comparing the explanations given by two ExAI models with human expert annotations. Following this, we analyze explanations for individual test examples to understand the role of regions highlighted by explanations in correct or incorrect predictions made by the model.
翻译:拉格识别任务是音乐信息检索领域一个非常热门的研究问题。少数探索该任务的研究采用了多种方法,例如信号处理、机器学习方法以及近年来基于深度学习的方法。然而,所有这些工作中仍有一个关键问题尚未得到解答:这些机器学习/深度学习方法学习和解释拉格的方式是否与人类专家相似?此外,这项研究的一个主要障碍是缺乏充足、标注丰富的可用数据集,而这正是驱动这些基于机器学习/深度学习方法的必要条件。本文中,我们介绍了"Prasarbharti印度音乐"版本1(PIM-v1),这是一个新颖的数据集,包含191小时经过精心标注的北印度古典音乐录音,据我们所知,这是目前最大的标注北印度古典音乐录音数据集。我们的方法包括进行消融研究,以找到使用PIM-v1数据集进行自动拉格识别(ARI)的基准分类模型。我们在12个拉格类别的子集上实现了0.89的分块f1分数。随后,我们采用模型可解释性技术来评估分类器的预测,旨在确定这些预测是与人类对拉格的理解相一致,还是由任意模式驱动。我们通过比较两个可解释人工智能模型给出的解释与人类专家标注,验证了模型预测的正确性。之后,我们分析了个别测试样本的解释,以理解解释所突出强调的音频区域在模型做出正确或错误预测中所起的作用。