Unseen but not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models

We introduce Dataset Concealment (DSC), a rigorous new procedure for evaluating and interpreting objective speech quality estimation models. DSC quantifies and decomposes the performance gap between research results and real-world application requirements, while offering context and additional insights into model behavior and dataset characteristics. We also show the benefits of addressing the corpus effect by using the dataset Aligner from AlignNet when training models with multiple datasets. We demonstrate DSC and the improvements from the Aligner using nine training datasets and nine unseen datasets with three well-studied models: MOSNet, NISQA, and a Wav2Vec2.0-based model. DSC provides interpretable views of the generalization capabilities and limitations of models, while allowing all available data to be used at training. An additional result is that adding the 1000 parameter dataset Aligner to the 94 million parameter Wav2Vec model during training does significantly improve the resulting model's ability to estimate speech quality for unseen data.

翻译：本文提出数据集隐藏（DSC）这一严谨的新流程，用于评估和解释客观语音质量估计模型。DSC量化并分解了研究成果与实际应用需求之间的性能差距，同时为模型行为与数据集特性提供背景信息与额外洞见。我们还展示了在利用多数据集训练模型时，通过AlignNet中的数据集对齐器处理语料库效应的优势。我们使用九个训练数据集和九个未见数据集，结合三种经过充分研究的模型——MOSNet、NISQA以及基于Wav2Vec2.0的模型，验证了DSC方法及对齐器带来的改进。DSC为模型的泛化能力与局限性提供了可解释的视角，同时允许在训练中使用全部可用数据。另一重要发现是：在训练过程中为包含9400万参数的Wav2Vec模型添加仅含1000参数的数据集对齐器，能显著提升所得模型对未见数据的语音质量估计能力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《数据创新：桥接传统方法与大型语言模型以应对罕见高影响事件》最新报告

专知会员服务

17+阅读 · 2月25日

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日

大模型如何利用数据？北大华为等最新《大型语言模型的数据管理》综述

专知会员服务

99+阅读 · 2023年12月6日