Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.
翻译:关于深度伪造语音检测器的鲁棒性和公平性的声明,其可信度取决于用于训练和评估这些系统的数据集。我们对深度伪造语音数据集领域进行了数据集层面的审计。我们汇总并分析了39个深度伪造语音数据集,考察了关键属性,包括可访问性、文档记录、人口统计与语言覆盖范围、数据集规模以及底层真实语音来源。我们的审计揭示了两个重要发现。首先,公平性评估在很大程度上不可行,因为大多数数据集缺乏人口统计元数据,只有少数包含性别或语言标签。这阻碍了任何有意义的子组分析,并使其他人口统计属性未被解决。其次,我们识别出不同数据集之间底层真实语音来源语料库存在大量重叠,这可能削弱跨数据集评估的有效性,并导致泛化性声明被夸大。