Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.
翻译:解释深度学习模型(包括音频分类器)的决策对于确保该技术的透明度和可信度至关重要。本文提出LMAC-ZS(零样本场景下音频分类器的可听化映射),据我们所知,这是首个基于解码器的后验解释方法,用于解释零样本音频分类器的决策。该方法采用一种新颖的损失函数,能够最大化给定文本-音频对原始相似度的忠实度。我们使用对比语言-音频预训练(CLAP)模型进行了广泛评估,结果表明在零样本分类场景中,我们的解释器能够忠实反映模型决策。此外,我们通过定性分析证明,该方法生成的解释具有明确意义,且与不同文本提示之间呈现良好相关性。