Data Augmentation through generating pseudo data has been proven effective in mitigating the challenge of data scarcity in the field of Grammatical Error Correction (GEC). Various augmentation strategies have been widely explored, most of which are motivated by two heuristics, i.e., increasing the distribution similarity and diversity of pseudo data. However, the underlying mechanism responsible for the effectiveness of these strategies remains poorly understood. In this paper, we aim to clarify how data augmentation improves GEC models. To this end, we introduce two interpretable and computationally efficient measures: Affinity and Diversity. Our findings indicate that an excellent GEC data augmentation strategy characterized by high Affinity and appropriate Diversity can better improve the performance of GEC models. Based on this observation, we propose MixEdit, a data augmentation approach that strategically and dynamically augments realistic data, without requiring extra monolingual corpora. To verify the correctness of our findings and the effectiveness of the proposed MixEdit, we conduct experiments on mainstream English and Chinese GEC datasets. The results show that MixEdit substantially improves GEC models and is complementary to traditional data augmentation methods.
翻译:摘要:通过生成伪数据进行数据增强已被证明能有效缓解语法纠错(GEC)领域的数据稀缺挑战。尽管多种增强策略已被广泛探索,其中大多数基于两条启发式原则——即提升伪数据的分布相似性与多样性,但这些策略有效性的内在机制仍不明确。本文旨在阐明数据增强如何提升GEC模型。为此,我们引入两种可解释且计算高效的度量指标:亲和度(Affinity)与多样性(Diversity)。研究发现,一种优秀的GEC数据增强策略应具有高亲和度与适度多样性,从而更有效地提升GEC模型性能。基于此发现,我们提出MixEdit——一种无需额外单语语料库、能够策略性且动态增强真实数据的数据增强方法。为验证研究发现与所提MixEdit的有效性,我们在主流英文与中文GEC数据集上进行实验。结果表明,MixEdit能显著提升GEC模型性能,并与传统数据增强方法形成互补。