Methods for automatically assessing speech quality in real world environments are critical for developing robust human language technologies and assistive devices. Behavioral ratings provided by human raters (e.g., mean opinion scores; MOS) are considered the gold standard, but they are susceptible to variability between individual raters, cannot easily be generalized across corpora, and are labor-intensive to collect, thus limiting the acoustic challenges they can quantify. Here, we present a new, scalable method for automatically assessing speech quality: the self-supervised speech quality assessment (S3QA) model. First, we manipulated high quality utterances from multiple speech corpora, using a wide range of acoustic challenges intended to emulate common sources of quality degradation in the real-world: frequency filtering, reverberation, background noise, and digital compression. Second, we leveraged an existing, pre-trained speech foundation model, WavLM, to computationally derive a self-supervised training target that quantified speech degradation using the cosine distance between the clean and degraded versions of each utterance in the embedding space. Next, we trained a transformer-based model to predict these cosine distances, given only the degraded versions of the utterances. Finally, the trained model was evaluated on unseen test corpora of synthetic mixtures, NISQA, and VOiCES. We show that the S3QA model trained on this task accurately predicts degradation cosine distances across a wide range challenging acoustic conditions and is aligned with both behavioral ratings (MOS), speech technology performance (automatic speech recognition) and other important features of the held-out data (e.g., microphone distances). This model provides an automated, scalable method for assessing speech quality across a wide range of acoustic challenges.
翻译:在现实环境中自动评估语音质量的方法对于开发鲁棒的人类语言技术和辅助设备至关重要。由人工评分者提供的行为评分(例如平均意见得分;MOS)被视为黄金标准,但这些评分易受个体评分者间差异的影响,难以跨语料库泛化,且收集过程劳动密集,从而限制了其可量化的声学挑战范围。本文提出了一种新的、可扩展的自动语音质量评估方法:自监督语音质量评估(S3QA)模型。首先,我们基于多个语音语料库的高质量语音样本,通过一系列旨在模拟现实世界中常见质量退化源的声学挑战(包括频率滤波、混响、背景噪声和数字压缩)对其进行处理。其次,我们利用现有的预训练语音基础模型WavLM,在嵌入空间中通过计算每个语音样本的纯净版本与退化版本之间的余弦距离,以自监督方式推导出量化语音退化的训练目标。接着,我们训练了一个基于Transformer的模型,使其仅根据语音的退化版本预测这些余弦距离。最后,该训练模型在未见过的合成混合测试语料库、NISQA和VOiCES上进行了评估。结果表明,在此任务上训练的S3QA模型能够准确预测多种复杂声学条件下的退化余弦距离,并且与行为评分(MOS)、语音技术性能(自动语音识别)以及保留数据的重要特征(例如麦克风距离)均保持一致。该模型为跨广泛声学挑战的语音质量评估提供了一种自动化、可扩展的方法。