The task of Grammatical Error Correction (GEC) aims to automatically correct grammatical errors in natural texts. Almost all previous works treat annotated training data equally, but inherent discrepancies in data are neglected. In this paper, the inherent discrepancies are manifested in two aspects, namely, accuracy of data annotation and diversity of potential annotations. To this end, we propose MainGEC, which designs token-level and sentence-level training weights based on inherent discrepancies in accuracy and potential diversity of data annotation, respectively, and then conducts mixed-grained weighted training to improve the training effect for GEC. Empirical evaluation shows that whether in the Seq2Seq or Seq2Edit manner, MainGEC achieves consistent and significant performance improvements on two benchmark datasets, demonstrating the effectiveness and superiority of the mixed-grained weighted training. Further ablation experiments verify the effectiveness of designed weights of both granularities in MainGEC.
翻译:语法纠错(GEC)任务旨在自动纠正自然文本中的语法错误。以往的工作几乎都将标注训练数据一视同仁,但忽略了数据中固有的差异。本文中,这些固有差异体现在两个方面,即数据标注的准确性和潜在标注的多样性。为此,我们提出MainGEC,该模型分别基于数据标注的准确性和潜在多样性的固有差异设计词级和句级训练权重,进而进行混合粒度加权训练以提升GEC的训练效果。实证评估表明,无论在Seq2Seq还是Seq2Edit方式下,MainGEC在两个基准数据集上均取得了一致且显著的性能提升,验证了混合粒度加权训练的有效性和优越性。进一步的消融实验证实了MainGEC中两种粒度设计权重的有效性。