Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.
翻译:当前,通过合成数据进行数据增强已广泛应用于语法错误校正领域,以缓解数据稀缺问题。然而,由于错误分布不一致和标签噪声的存在,这些合成数据主要被用于预训练阶段而非数据受限的微调阶段。本文提出一种基于上下文增强的合成数据构建方法,该方法能够确保在更一致错误分布下实现对原始数据的高效增强。具体而言,我们将基于规则的替换与基于模型的生成相结合,利用生成模型为提取的错误模式生成更丰富的上下文。此外,我们还提出一种基于重标记的数据清洗方法,以减轻合成数据中噪声标签的影响。在CoNLL14和BEA19-Test数据集上的实验表明,我们提出的增强方法持续且显著地优于强基线模型,并仅使用少量合成数据即达到了最先进水平。