Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. This process usually relies on a cascaded pipeline comprising an ASR system, disfluency removal, and GEC, with the associated concern of propagating errors between these individual modules. In this paper, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework or part of it, e.g., ASR and disfluency removal. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency detection and removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the paper discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.
翻译:语法反馈对于二语学习者、教师和测试者至关重要。口语语法纠错旨在为二语学习者的口语语法使用提供反馈。该过程通常依赖于一个级联流水线,包含自动语音识别系统、不流利成分去除和语法纠错模块,存在各独立模块间误差传播的固有问题。本文提出一种替代性的"端到端"口语语法纠错方法,利用语音识别基础模型Whisper。该基础模型可用于替代整个框架或其部分组件(如语音识别和不流利成分去除模块)。我们在自由口语能力评估测试Linguaskill获取的数据上,将这些端到端方法与标准级联方法进行比较。结果表明,在此架构下实现端到端口语语法纠错是可行的,但与使用大量文本语法纠错数据的系统相比,现有数据的匮乏限制了当前性能。相反,基于注意力的Whisper模型更易学习的端到端不流利检测与去除任务,其表现确实优于级联方法。此外,本文还探讨了使用端到端系统进行口语语法纠错时向考生提供反馈所面临的挑战。