Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the perceptual quality of the synthetic speech. While recent approaches using pre-trained self-supervised learning (SSL) models have shown promising results, they only partly address the data scarcity issue for the feature extractor. This leaves the data scarcity issue for the decoder unresolved and leading to suboptimal performance. To address this challenge, we propose a retrieval-augmented MOS prediction method, dubbed {\bf RAMP}, to enhance the decoder's ability against the data scarcity issue. A fusing network is also proposed to dynamically adjust the retrieval scope for each instance and the fusion weights based on the predictive confidence. Experimental results show that our proposed method outperforms the existing methods in multiple scenarios.
翻译:自动平均意见得分(MOS)预测对评估合成语音的感知质量至关重要。虽然近期采用预训练自监督学习(SSL)模型的方法已展现出良好效果,但这些方法仅部分解决了特征提取器的数据稀缺问题,导致解码器的数据稀缺问题仍未得到解决,进而造成性能欠佳。为应对这一挑战,我们提出了一种检索增强型MOS预测方法(简称**RAMP**),以提升解码器在数据稀缺场景下的能力。此外,还设计了一种融合网络,能够基于预测置信度为每个样本动态调整检索范围与融合权重。实验结果表明,我们所提方法在多个场景下均优于现有方法。