A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model. Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available. As the output of these models is designed to be human readable, punctuation is added, numbers are presented in Arabic numeric form and abbreviations are included. Additionally, these models have a tendency to skip disfluencies and hesitations in the output. Though useful for readability, these attributes are not helpful for assessing the ability of a candidate and providing feedback. Here a precise transcription of what a candidate said is needed. In this paper, we give a detailed analysis of Whisper outputs and propose two solutions: fine-tuning and soft prompt tuning. Experiments are conducted on both public speech corpora and an English learner dataset. Results show that we can effectively alter the decoding behaviour of Whisper to generate the exact words spoken in the response.
翻译:精确可靠的口语评估系统的关键在于其底层ASR模型。近年来,诸如Whisper等大规模预训练ASR基础模型已开放使用。由于此类模型的输出设计为人类可读形式,因此会添加标点符号、以阿拉伯数字形式呈现数字并包含缩写。此外,这些模型倾向于在输出中省略口误和犹豫。尽管这有助于提升可读性,但对于评估应试者能力并提供反馈并无助益,而评估场景需要对应试者所述内容进行精确转写。本文通过对Whisper输出进行详细分析,提出两种解决方案:微调和软提示调优。我们在公开语音语料库和英语学习者数据集上开展了实验。结果表明,我们能够有效改变Whisper的解码行为,使其生成应答中实际说出的准确词语。