Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research.
翻译:主观语音质量评估(SSQA)对于评估人类听者感知的语音样本至关重要。基于模型的SSQA得益于深度神经网络(DNNs)的发展已取得巨大成功,但泛化能力仍然是一个关键挑战,尤其是在面对未见过的、领域外数据时。为了对SSQA模型的泛化能力进行基准测试,我们提出了MOS-Bench,这是一个多样化的数据集集合。此外,我们还引入了SHEET,一个包含完整SSQA实验流程的开源工具包。我们提供了MOS-Bench的基准测试结果,并探索了多数据集训练以增强泛化能力。另外,我们提出了一个新的性能指标——最佳分数差异/比率,并利用潜在空间可视化来解释模型行为,为未来研究提供了有价值的见解。