The scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale-first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.
翻译:过去十年间,语音反欺骗数据集的规模呈指数级增长,这源于“更大规模数据带来更好性能”的假设。然而,无差别地扩大数据规模是否相应提升模型泛化能力仍不明确。本研究通过解耦训练数据规模与多样性的影响,挑战了“规模优先”范式。基于代表性数据集的实验,我们报告两项关键发现:(1)更大并不总是更好。在固定生成方法下过度扩大数据规模会带来边际递减的收益,甚至因过拟合而损害跨域泛化能力。(2)多样性优于规模。一个包含多种攻击手段的小规模复合训练集,在跨数据集评估中显著优于规模更大但多样性有限的数据集。我们得出结论:未来数据集构建应优先考虑生成方法的多样性而非规模,以有效提升模型泛化能力。