非母语英语自动语音识别：准确性与不流利处理研究 (Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling)

Automatic speech recognition (ASR) has been an essential component of computer assisted language learning (CALL) and computer assisted language testing (CALT) for many years. As this technology continues to develop rapidly, it is important to evaluate the accuracy of current ASR systems for language learning applications. This study assesses five cutting-edge ASR systems' recognition of non-native accented English speech using recordings from the L2-ARCTIC corpus, featuring speakers from six different L1 backgrounds (Arabic, Chinese, Hindi, Korean, Spanish, and Vietnamese), in the form of both read and spontaneous speech. The read speech consisted of 2,400 single sentence recordings from 24 speakers, while the spontaneous speech included narrative recordings from 22 speakers. Results showed that for read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively, approaching human-level accuracy. For spontaneous speech, RevAI performed best with a mean MER of 0.063. The study also examined how each system handled disfluencies such as filler words, repetitions, and revisions, finding significant variation in performance across systems and disfluency types. While processing speed varied considerably between systems, longer processing times did not necessarily correlate with better accuracy. By detailing the performance of several of the most recent, widely-available ASR systems on non-native English speech, this study aims to help language instructors and researchers understand the strengths and weaknesses of each system and identify which may be suitable for specific use cases.

翻译：自动语音识别（ASR）多年来一直是计算机辅助语言学习（CALL）和计算机辅助语言测试（CALT）的核心组成部分。随着该技术的持续快速发展，评估当前ASR系统在语言学习应用中的准确性显得尤为重要。本研究基于L2-ARCTIC语料库的录音数据，评估了五种前沿ASR系统对非母语口音英语的识别性能。该语料库包含六种母语背景（阿拉伯语、汉语、印地语、韩语、西班牙语和越南语）的发音人，涵盖朗读式与自发式两种语音类型。朗读语音包含24位发音人的2400条单句录音，自发语音则包含22位发音人的叙事性录音。结果显示，在朗读语音任务中，Whisper和AssemblyAI系统表现最优，其平均匹配错误率（MER）分别为0.054和0.056，接近人类水平准确度。在自发语音任务中，RevAI系统以0.063的平均MER取得最佳性能。研究还考察了各系统对填充词、重复和修正等不流利现象的处理能力，发现不同系统及不流利类型间存在显著性能差异。尽管各系统处理速度差异较大，但更长的处理时间并不必然对应更高的准确率。通过详细分析多种最新且广泛可用的ASR系统在非母语英语语音上的表现，本研究旨在帮助语言教师和研究人员理解各系统的优势与局限，并为特定应用场景选择合适的系统提供参考。