Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.
翻译:近期以Whisper为代表的自动语音识别(ASR)系统取得显著进展,表明在充足数据支持下,这些系统已具备接近人类水平的性能潜力。然而,由于缺乏适用的儿童语音数据库以及儿童语音的特殊性,这一进展尚未有效延伸至儿童ASR领域。近期研究尝试利用My Science Tutor(MyST)儿童语音语料库提升Whisper对儿童语音的识别能力,并在有限测试集上取得初步改善。本文基于上述成果,通过更高效的数据预处理方法提升MyST数据集的利用价值。我们将MyST测试集的词错误率(WER)从13.93%降至9.11%(使用Whisper-Small模型),从13.23%降至8.61%(使用Whisper-Medium模型),并证明该改进可推广至未见过数据集。同时,我们揭示了提升儿童ASR性能面临的重要挑战。研究结果展示了Whisper在实现高效儿童语音识别中的可行性与有效性。