Self-supervised speech recognition models require considerable labeled training data for learning high-fidelity representations for Automatic Speech Recognition (ASR) which is computationally demanding and time-consuming. We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR. We discover that the dataset pruning strategies used in vision tasks for sampling the most informative examples do not perform better than random subset selection on fine-tuning self-supervised ASR. We then present the COWERAGE algorithm for representative subset selection in self-supervised ASR. COWERAGE is based on our finding that ensuring the coverage of examples based on training Word Error Rate (WER) in the early training epochs leads to better generalization performance. Extensive experiments with the wav2vec 2.0 and HuBERT model on TIMIT, Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE and its transferability across models, with up to 17% relative WER improvement over existing dataset pruning methods and random sampling. We also demonstrate that the coverage of training instances in terms of WER values ensures the inclusion of phonemically diverse examples, leading to better test accuracy in self-supervised speech recognition models.
翻译:自监督语音识别模型需要大量标注训练数据来学习适用于自动语音识别(ASR)的高保真表示,这既计算量大又耗时。我们考虑在ASR的自监督语音模型高效微调中,识别最优数据子集的任务。我们发现,视觉任务中用于采样最具信息量样本的数据集剪枝策略,在微调自监督ASR时并不优于随机子集选择。随后,我们提出COWERAGE算法,用于自监督ASR中的代表性子集选择。COWERAGE基于我们的发现:在训练早期阶段确保基于训练词错误率(WER)的样本覆盖性,能够带来更好的泛化性能。使用wav2vec 2.0和HuBERT模型在TIMIT、Librispeech和LJSpeech数据集上的大量实验,证明了COWERAGE的有效性及其跨模型的可迁移性,与现有数据集剪枝方法和随机采样相比,相对WER最多降低17%。我们还证明,基于WER值的训练实例覆盖性确保了音素多样性样本的包含,从而在自监督语音识别模型中带来更好的测试准确率。