Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. Surprisingly, optimizing for acoustic, speaker, or linguistic diversity yields no clear improvements over random sampling. Instead, we find that prioritizing the longest utterances achieves superior ASR results while using only half the original dataset, reducing pre-training time by 24% on a large corpora. These findings suggest that for pre-training speech SSL models, data length is a more critical factor than either data diversity or overall data quantity for performance and efficiency, offering a new perspective for data selection strategies in SSL speech processing.
翻译:自监督学习(SSL)已彻底改变了语音处理领域,但其对大规模预训练数据集的依赖仍是一个瓶颈。尽管鲁棒性常被归因于数据规模和多样性,但数据分布的作用尚不明确。我们系统性地研究了预训练数据的精选子集如何影响自动语音识别(ASR)性能。令人惊讶的是,针对声学、说话人或语言多样性进行优化并未产生优于随机采样的明显改进。相反,我们发现优先选择最长语音片段可在仅使用原始数据集一半数据量的情况下获得更优的ASR结果,在大型语料库上将预训练时间减少24%。这些发现表明,对于预训练语音SSL模型,数据长度是比数据多样性或总体数据量更关键的性能与效率影响因素,为SSL语音处理中的数据选择策略提供了新的视角。