A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model. Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available. As the output of these models is designed to be human readable, punctuation is added, numbers are presented in Arabic numeric form and abbreviations are included. Additionally, these models have a tendency to skip disfluencies and hesitations in the output. Though useful for readability, these attributes are not helpful for assessing the ability of a candidate and providing feedback. Here a precise transcription of what a candidate said is needed. In this paper, we give a detailed analysis of Whisper outputs and propose two solutions: fine-tuning and soft prompt tuning. Experiments are conducted on both public speech corpora and an English learner dataset. Results show that we can effectively alter the decoding behaviour of Whisper to generate the exact words spoken in the response.
翻译:准确可靠的口语评估系统的关键组成部分是底层ASR模型。近年来,Whisper等大规模预训练ASR基础模型已可供使用。由于这些模型的输出设计为人类可读形式,因此会添加标点符号、数字以阿拉伯数字形式呈现,并包含缩写。此外,这些模型在输出中倾向于跳过不流畅和犹豫。虽然这些特性有助于可读性,但对评估考生能力并提供反馈并无帮助。在此类场景下,需要精确转录考生所说的内容。本文对Whisper输出进行了详细分析,并提出了两种解决方案:微调和软提示调优。在公共语音语料库和英语学习者数据集上进行了实验。结果表明,我们能够有效调整Whisper的译码行为,以生成作答中实际所说的单词。