Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.
翻译:第二语言学习者在自发语音中普遍存在的不合语法表达与不流畅现象,对自动语音识别系统构成了独特挑战。然而,目前鲜有专门针对二语学习者语音的数据集。我们公开发布了LearnerVoice数据集,该数据集包含50.04小时的二语学习者自发语音的音频及转写文本。我们的语言学分析表明,本数据集转写文本中蕴含的二语学习者自发语音特征——包括不合语法表达与不流畅现象(如填充词、词语重复、自我修正、错误起始)——显著多于母语语音数据集。使用LearnerVoice对whisper-small.en进行微调后,其词错误率降至10.26%,较原始模型降低了44.2%。进一步的定性分析显示,原始模型在LearnerVoice数据集上的错误中有54.2%可归因于二语学习者语音特征,而微调模型成功减少了其中48.1%的错误。