Speech foundation models (SFMs) have been benchmarked on many speech processing tasks, often achieving state-of-the-art performance with minimal adaptation. However, the SFM paradigm has been significantly less explored for applications of interest to the speech perception community. In this paper we present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction. We focus on the non-intrusive setup of the Clarity Prediction Challenge 2 (CPC2), where the task is to predict the percentage of words correctly perceived by hearing-impaired listeners from speech-in-noise recordings. We propose a simple method that learns a lightweight specialized prediction head on top of frozen SFMs to approach the problem. Our results reveal statistically significant differences in performance across SFMs. Our method resulted in the winning submission in the CPC2, demonstrating its promise for speech perception applications.
翻译:语音基础模型已在众多语音处理任务中取得基准测试成果,通常仅需极少的适配即可达到最先进水平。然而,在语音感知研究领域,语音基础模型范式的探索仍显著不足。本文针对语音基础模型在语音可懂度预测这一特定应用场景,对10种不同模型进行了系统性评估。我们聚焦于Clarity预测挑战赛2的非侵入式任务——基于含噪语音录音预测听障人群正确感知词汇百分比。研究提出一种简洁方法:在冻结的语音基础模型之上,仅需训练轻量级专用预测头部即可处理该问题。实验结果显示不同模型间存在统计学显著的性能差异。我们提出的方法最终在CLARITY预测挑战赛2中获优胜提交成绩,充分展示了其在语音感知应用中的潜力。