This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results reveal that Whisper's embedding features can contribute to more accurate prediction performance. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to intrusive methods, MOSA-Net, and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics in Taiwan Mandarin Hearing In Noise test - Quality & Intelligibility (TMHINT-QI) dataset. To further validate its robustness, MOSA-Net+ was tested in the noisy-and-enhanced track of the VoiceMOS Challenge 2023, where it obtained the top-ranked performance among nine systems.
翻译:本研究通过利用大规模弱监督模型Whisper的声学特征,提出了一种增强版多目标语音评估模型——MOSA-Net+。我们首先探究了Whisper在构建更鲁棒语音评估模型中的有效性,继而研究了Whisper与自监督学习模型表征的融合方式。实验结果表明,Whisper的嵌入特征有助于提升预测精度,而将其与SSL模型特征结合仅带来边际性改进。相较于侵入式方法、MOSA-Net及其他基于SSL的语音评估模型,MOSA-Net+在台湾普通话听力噪声测试-音质与清晰度(TMHINT-QI)数据集的所有评估指标上,对主观音质与可懂度得分的估计能力均取得显著提升。为进一步验证其鲁棒性,MOSA-Net+参与了VoiceMOS Challenge 2023噪声增强赛道测试,在九个参赛系统中取得最优性能排名。