This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
翻译:本研究通过利用大规模预训练弱监督模型Whisper的声学特征构建嵌入特征,提出了多目标语音评估模型的增强版本MOSA-Net+。第一部分研究探讨了Whisper的嵌入特征与两种自监督学习(SSL)模型在主观质量和可懂度评分上的相关性。第二部分评估了Whisper在构建更鲁棒语音评估模型中的有效性。第三部分分析了在部署MOSA-Net+时,结合Whisper与SSL模型表征的可能性。实验结果表明,Whisper的嵌入特征与主观质量和可懂度的相关性优于其他SSL模型的嵌入特征,从而提升了MOSA-Net+的预测精度。此外,结合Whisper与SSL模型的嵌入特征仅带来边际改进。与MOSA-Net及其他基于SSL的语音评估模型相比,MOSA-Net+在所有评估指标上均显著提升了主观质量和可懂度评分的估计性能。我们进一步在VoiceMOS挑战赛2023的第三赛道测试了MOSA-Net+,并获得了最佳性能排名。