Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.
翻译:第二语言(L2)学习者在自发语音中普遍存在的不合语法表达与不流畅现象,对自动语音识别(ASR)系统构成了独特挑战。然而,目前鲜有专门针对L2学习者语音的数据集。我们公开发布LearnerVoice数据集,该数据集包含50.04小时的L2学习者自发语音音频及对应文本转写。我们的语言学分析表明,本数据集中的转写文本含有显著多于母语语音数据集的L2S(第二语言学习者自发语音)特征,这些特征包括不合语法表达及不流畅现象(如填充词、词语重复、自我修正、错误起始)。使用LearnerVoice对whisper-small.en进行微调后,其词错误率(WER)达到10.26%,较原始whisper-small.en模型降低44.2%。此外,我们的定性分析指出,原始模型在LearnerVoice数据集上54.2%的错误可归因于L2S特征,而微调模型成功减少了其中48.1%的错误。