In online conferencing applications, estimating the perceived quality of an audio signal is crucial to ensure high quality of experience for the end user. The most reliable way to assess the quality of a speech signal is through human judgments in the form of the mean opinion score (MOS) metric. However, such an approach is labor intensive and not feasible for large-scale applications. The focus has therefore shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters). Surprisingly, we find two optimal regions for feature extraction: one in the lower-level features and one in the high-level features. Next, we investigate the reason for the two distinct optima. We hypothesize that the lower-level features capture characteristics of noise and room acoustics, whereas the high-level features focus on speech content and intelligibility. To investigate this, we analyze the sensitivity of the MOS predictions with respect to different levels of corruption in each category. Afterwards, we try fusing the two optimal feature depths to determine if they contain complementary information for MOS prediction. Finally, we compare the performance of the proposed models and assess the generalizability of the models on unseen datasets.
翻译:在在线会议应用中,估计音频信号的感知质量对确保最终用户的高体验质量至关重要。评估语音信号最可靠的方式是通过平均意见得分(MOS)指标的人工判断。然而,这种方法劳动强度大,不适用于大规模应用。因此,研究重点转向通过深度神经网络的端到端训练实现自动语音质量评估。近期研究表明,利用基于wav2vec的预训练XLS-R嵌入可在语音质量预测任务中达到最新性能。本文对该预训练模型进行了深度分析:首先,我们分析了从XLS-R各层提取的嵌入性能,以及模型各尺寸(3亿、10亿、20亿参数)的表现。令人惊讶的是,我们发现了两个最优特征提取区域——分别位于低层特征和高层特征。随后,我们探究了两个最优区域并存的原因,假设低层特征捕捉噪声和房间声学特性,而高层特征聚焦语音内容与可懂度。为验证该假设,我们分析了MOS预测对不同类别退化程度的敏感性。接着,我们尝试融合两个最优特征深度,以确定它们是否包含互补的MOS预测信息。最后,我们比较了所提模型的性能,并评估了模型在未见数据集上的泛化能力。