Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.
翻译:尽管深度学习模型在多种任务中展现出卓越性能,但其复杂性给模型解释带来了挑战。这一挑战在音频信号处理中尤为突出,因为传达可理解的解释本身存在固有困难。为解决此问题,我们提出可听化音频分类器解释图(L-MAC),一种能够生成忠实且可听化解释的后处理解释方法。L-MAC通过在预训练分类器上叠加解码器,生成用于突显输入音频相关片段的二值掩码。我们采用损失函数训练该解码器,该函数在最大化分类器对掩码保留音频片段决策置信度的同时,最小化模型对掩码剔除片段的输出概率。在域内与域外数据上的定量评估表明,与多种基于梯度和掩码的方法相比,L-MAC始终能生成更忠实的解释。此外,用户研究证实,平均而言用户更倾向于本文方法生成的解释。