An accurate assessment of L2 English pronunciation is crucial for language learning, as it provides personalized feedback and ensures a fair evaluation of individual progress. However, automated scoring remains challenging due to the complexity of sentence-level fluency, prosody, and completeness. This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances. The model generates rubric-aligned scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within +-2 tolerance, especially for high-quality speech. However, it tends to overpredict low-quality speech scores and lacks precision in error detection. These findings demonstrate the strong potential of speech LLMs in scalable pronunciation assessment and suggest future improvements through enhanced prompting, calibration, and phonetic integration to advance Computer-Assisted Pronunciation Training.
翻译:对第二语言英语发音的准确评估在语言学习中至关重要,它能提供个性化反馈并确保对个人进展的公平评价。然而,由于句子层面的流利度、韵律和完整性评估的复杂性,自动化评分仍具挑战性。本文评估了指令微调语音大语言模型Qwen2-Audio-7B-Instruct在5,000条Speechocean762语音片段上的零样本性能。该模型针对准确性、流利度、韵律和完整性生成与评分标准对齐的分数,结果显示其在±2容差范围内与人工评分具有高度一致性,尤其对高质量语音表现优异。然而,该模型倾向于高估低质量语音的得分,且在错误检测方面缺乏精确度。这些发现证明了语音大语言模型在可扩展发音评估方面的巨大潜力,并建议未来可通过增强提示设计、分数校准及音素特征整合来改进模型性能,从而推动计算机辅助发音训练的发展。