Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.
翻译:语音质量评估近期经历了从人类听觉专家设计向机器学习模型的范式转变。然而,现有模型主要依赖监督学习,其标签采集过程既耗时又昂贵。针对该问题,本文提出VQScore——一种基于矢量量化变分自编码器(VQ-VAE)量化误差的自监督语音评估指标。由于VQ-VAE的训练依赖纯净语音,当语音存在失真时可预期产生较大量化误差。为提升与真实质量分数的相关性,我们在模型设计中融入了语音处理的领域知识。研究发现,矢量量化机制还可用于自监督语音增强(SE)模型训练。为提升编码器在语音增强任务中的鲁棒性,本文引入了一种结合对抗训练的新型自蒸馏机制。综上,所提出的语音质量评估方法与增强模型仅需纯净语音即可完成训练,无需任何标签需求。实验结果表明,所提VQScore与增强模型在性能上可与监督基线方法相媲美。代码将在论文发表后开源。