Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).
翻译:音频信号分割是自动音频索引的关键任务,它涉及检测信号中同类片段的分界。在许多应用中,可解释人工智能是实现机器学习决策透明化的重要过程。本文提出一种可解释的多标签分割模型,可同时解决语音活动检测(SAD)、音乐检测(MD)、噪声检测(ND)和重叠语音检测(OSD)等任务。该代理模型利用非负矩阵分解(NMF)将用于分割的嵌入特征映射至频域。在两个数据集上的实验表明,该模型在保持强可解释性的同时,其性能与预训练黑盒模型相当。具体而言,用于决策的频带可在片段层级(局部解释)和全局层级(类别原型)轻松识别。