Explainable AI (XAI) techniques have been widely used to help explain and understand the output of deep learning models in fields such as image classification and Natural Language Processing. Interest in using XAI techniques to explain deep learning-based automatic speech recognition (ASR) is emerging. but there is not enough evidence on whether these explanations can be trusted. To address this, we adapt a state-of-the-art XAI technique from the image classification domain, Local Interpretable Model-Agnostic Explanations (LIME), to a model trained for a TIMIT-based phoneme recognition task. This simple task provides a controlled setting for evaluation while also providing expert annotated ground truth to assess the quality of explanations. We find a variant of LIME based on time partitioned audio segments, that we propose in this paper, produces the most reliable explanations, containing the ground truth 96% of the time in its top three audio segments.
翻译:可解释人工智能(XAI)技术已被广泛应用于图像分类和自然语言处理等领域,帮助解释和理解深度学习模型的输出。近年来,利用XAI技术解释基于深度学习的自动语音识别(ASR)的兴趣日益增长,但目前尚缺乏足够证据表明这些解释是否值得信赖。为应对这一问题,我们将图像分类领域的一种先进XAI技术——局部可解释模型无关解释(LIME)——适配到一个基于TIMIT音素识别任务训练的模型上。这一简单任务为评估提供了可控环境,同时通过专家标注的真实数据来评判解释质量。我们发现,本文提出的一种基于时间分割音频片段的LIME变体能够生成最可靠的解释,其前三项音频片段包含真实数据的概率高达96%。