Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users. Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7.
翻译:神经网络已成功用于非侵入式语音可懂度预测。近期研究发现,从预训练自监督和弱监督模型的中间层提取的特征表示对此任务尤为有效。本研究将Whisper自动语音识别解码器中间层表征作为神经网络输入特征,与基于样例的、具有心理动机的人类记忆模型相结合,用于预测助听器用户的主观可懂度评分。相较于已有的侵入式HASPI基线系统,本方法在增强系统和训练数据中未见过的听者上均展现出显著性能提升,其均方根误差为25.3,而基线为28.7。