Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.
翻译:学生作文中的许多错误可归因于母语(L1)的影响。L1干扰指受说话者第一语言影响产生的错误,例如使用"stadion"而非"stadium",这反映了俄语词汇的音译现象。本研究致力于检测俄语母语者英语作文中的此类错误。我们提出了RILEC——一个包含超过18,000个句子的大规模数据集,该数据集融合了来自REALEC的专家标注数据以及通过规则驱动和神经增强生成的合成样本。我们提出了一种生成L1动机错误的框架,该框架采用经过PPO优化、提示控制和规则模式增强的生成式语言模型。在RILEC上微调的模型表现出优异性能,尤其在音译和时态语义等词汇层面干扰类型上效果显著。研究发现,所提出的数据增强流程能显著提升模型性能,使其成为学习者和教师更有效识别与纠正此类错误的潜在实用工具。