RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.

翻译：学生作文中的许多错误可归因于母语（L1）的影响。L1干扰指受说话者第一语言影响产生的错误，例如使用"stadion"而非"stadium"，这反映了俄语词汇的音译现象。本研究致力于检测俄语母语者英语作文中的此类错误。我们提出了RILEC——一个包含超过18,000个句子的大规模数据集，该数据集融合了来自REALEC的专家标注数据以及通过规则驱动和神经增强生成的合成样本。我们提出了一种生成L1动机错误的框架，该框架采用经过PPO优化、提示控制和规则模式增强的生成式语言模型。在RILEC上微调的模型表现出优异性能，尤其在音译和时态语义等词汇层面干扰类型上效果显著。研究发现，所提出的数据增强流程能显著提升模型性能，使其成为学习者和教师更有效识别与纠正此类错误的潜在实用工具。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

【CVPR2024】SNIFFER：用于可解释的脱离上下文谣言检测的多模态大型语言模型

专知会员服务

19+阅读 · 2024年3月6日

【ICLR2024】能检测到LLM产生的错误信息吗？

专知会员服务

25+阅读 · 2024年1月23日