Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data, adult speech data, and a combination of both, to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model achieves the best Word Error Rate (WER) of 7.42 on the MyST child speech dataset, 2.99 on the PFSTAR dataset and 12.47 on the CMU KIDS dataset as compared to any other previous methods. Our models outperformed the wav2vec2 BASE 960 on child speech which is considered a state-of-the-art ASR model on adult speech by just using 10 hours of child speech data in finetuning. The analysis of different types of training data and their effect on inference is also provided by using a combination of datasets in pretraining, finetuning and inference.
翻译:尽管深度学习技术近年来取得了显著进展,儿童语音识别仍然是一项具有挑战性的任务。当前的自动语音识别(ASR)模型需要大量带标注数据进行训练,而这类数据十分稀缺。在本研究中,我们探索使用ASR模型wav2vec2,结合不同的预训练和微调配置进行自监督学习(SSL),以改进自动儿童语音识别。通过使用不同数量的儿童语音训练数据、成人语音数据以及两者的组合对预训练的wav2vec2模型进行微调,以发现儿童ASR任务中模型微调所需的最优数据量。与以往任何方法相比,我们的训练模型在MyST儿童语音数据集上取得了最佳词错误率(WER)7.42,在PFSTAR数据集上为2.99,在CMU KIDS数据集上为12.47。通过仅使用10小时儿童语音数据进行微调,我们的模型在儿童语音上的性能超越了被视为成人语音领域最先进ASR模型的wav2vec2 BASE 960。此外,本文还通过预训练、微调和推理阶段的数据集组合,分析了不同类型训练数据对推理效果的影响。