This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
翻译:本研究通过利用大规模预训练弱监督模型Whisper的声学特征构建嵌入特征,提出了一种增强版多目标语音评估模型——MOSA-Net+。第一部分研究探讨了Whisper嵌入特征与两种自监督学习(SSL)模型在主观质量及可懂度分数上的相关性;第二部分评估了Whisper在构建更鲁棒语音评估模型中的有效性;第三部分分析了在部署MOSA-Net+时融合Whisper与SSL模型表征的可能性。实验结果表明,Whisper嵌入特征与主观质量及可懂度的相关性显著强于其他SSL嵌入特征,从而助力MOSA-Net+实现更精确的预测性能。此外,融合Whisper与SSL模型的嵌入特征仅带来边际性提升。相较于MOSA-Net及其他基于SSL的语音评估模型,MOSA-Net+在所有评估指标上均展现出对主观质量及可懂度估计的显著改进。我们进一步将MOSA-Net+应用于VoiceMOS 2023挑战赛第三赛道,并取得了排名第一的性能。