We address quality assessment for neural network based ASR by providing explanations that help increase our understanding of the system and ultimately help build trust in the system. Compared to simple classification labels, explaining transcriptions is more challenging as judging their correctness is not straightforward and transcriptions as a variable-length sequence is not handled by existing interpretable machine learning models. We provide an explanation for an ASR transcription as a subset of audio frames that is both a minimal and sufficient cause of the transcription. To do this, we adapt existing explainable AI (XAI) techniques from image classification-Statistical Fault Localisation(SFL) and Causal. Additionally, we use an adapted version of Local Interpretable Model-Agnostic Explanations (LIME) for ASR as a baseline in our experiments. We evaluate the quality of the explanations generated by the proposed techniques over three different ASR ,Google API, the baseline model of Sphinx, Deepspeech and 100 audio samples from the Commonvoice dataset.
翻译:我们通过提供有助于增进对系统理解并最终建立对系统信任的解释,来解决基于神经网络的自动语音识别(ASR)的质量评估问题。与简单的分类标签相比,解释转录文本更具挑战性,因为判断其正确性并不直接,且现有可解释机器学习模型无法处理变长序列的转录文本。我们提供了一种对ASR转录文本的解释,将其定义为音频帧的一个子集,该子集既是转录文本的最小原因,也是充分原因。为此,我们借鉴了图像分类中的现有可解释人工智能(XAI)技术——统计故障定位(SFL)和因果方法。此外,我们采用针对ASR改编的局部可解释模型无关解释(LIME)版本作为实验基线。我们通过三种不同的ASR系统(Google API、Sphinx基线模型、DeepSpeech)以及来自Commonvoice数据集的100个音频样本,评估了所提技术生成解释的质量。