Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).
翻译:音频信号分割是自动音频索引的关键任务,其内容包括检测信号中类别同质片段的边界。在许多应用中,可解释人工智能是确保机器学习决策透明性的关键流程。本文提出一种可解释的多标签分割模型,可同时解决语音活动检测(SAD)、音乐检测(MD)、噪声检测(ND)及重叠语音检测(OSD)问题。该代理模型利用非负矩阵分解(NMF)将用于分割的嵌入特征映射至频域。在两个数据集上的实验表明,该模型在保持与预训练黑箱模型相当性能的同时,展现出强大的可解释性特征。具体而言,用于决策的频带可在片段层面(局部解释)和全局层面(类别原型)被清晰识别。