Automatic Mean Opinion Score (MOS) prediction is employed to evaluate the quality of synthetic speech. This study extends the application of predicted MOS to the task of Fake Audio Detection (FAD), as we expect that MOS can be used to assess how close synthesized speech is to the natural human voice. We propose MOS-FAD, where MOS can be leveraged at two key points in FAD: training data selection and model fusion. In training data selection, we demonstrate that MOS enables effective filtering of samples from unbalanced datasets. In the model fusion, our results demonstrate that incorporating MOS as a gating mechanism in FAD model fusion enhances overall performance.
翻译:自动平均意见得分(MOS)预测被用于评估合成语音的质量。本研究将预测MOS的应用扩展到虚假音频检测(FAD)任务,期望MOS能够衡量合成语音与自然人声的相似程度。我们提出MOS-FAD框架,将MOS应用于FAD的两个关键环节:训练数据选择与模型融合。在训练数据选择中,我们证明MOS能有效过滤非平衡数据集中的样本;在模型融合方面,实验结果表明,将MOS作为门控机制融入FAD模型融合可显著提升整体性能。