Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme regions.
翻译:语音包含与某些疾病临床相关的信息,具有用于健康评估的潜力。近期研究显示出将深度学习算法(特别是预训练大规模语音模型)应用于自动语音评估任务的兴趣。一个尚未被深入探讨的问题是:这些模型如何根据输入数据生成输出结果。本研究在发声障碍检测任务中训练并比较了两种配置的音频频谱图Transformer模型,并应用注意力展开方法生成模型相关性图谱——即模型进行预测时计算的频谱图区域相关性。我们利用这些图谱分析模型在不同条件下的预测机制,并证明随着模型微调的进行,注意力分布范围逐渐缩小,模型注意力集中于特定音素区域。