Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

翻译：本文报告了三项近期实验，这些实验利用大规模语音模型评估加纳学生的口语朗读流利度（ORF）。尽管ORF是衡量基础识字能力的成熟指标，但其评估通常需要学生与经过培训的评估者进行一对一测试，这一过程既耗时又成本高昂。实现ORF评估的自动化有助于提升识字教学质量，尤其是在班级规模大、资源有限导致形成性评估难以普及的教育环境中。据我们所知，本研究是首批探讨使用最新版本的大规模语音模型（Whisper V2、wav2vec2.0）评估全球南方地区ORF的研究之一。我们发现，Whisper V2对加纳学生朗读语音转录的词错误率（WER）为13.5，接近该模型在成人语音上的平均词错误率（12.8），这一结果在几年前已可被视为儿童语音转录领域的最优水平。此外，当使用这些转录结果生成全自动ORF评分时，其与专家人工评分高度一致，相关系数达0.96。重要的是，这些成果是在具有代表性的数据集（即包含学生地域口音的实际课堂录音）上，直接使用免费公开的语音模型（未进行微调）实现的。这表明，在资源匮乏、语言多样化的教育环境中，利用大规模语音模型评估ORF或具有可行性与可扩展性。