Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic data, pretraining on it, and then fine-tuning on real datasets; performance gains have been achieved either by ensembling or by using huge pretrained models such as XXL-T5 as the backbone. In this work, we explore an orthogonal direction: how to use available data more efficiently. First, we propose auxiliary tasks that exploit the alignment between the original and corrected sentences, such as predicting a sequence of corrections. We formulate each task as a sequence-to-sequence problem and perform multi-task training. Second, we discover that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance, so we set out to find the best training schedule. Together, these two ideas lead to significant improvements, producing results that improve state of the art with much smaller models; in particular, we outperform the best models based on T5-XXL (11B parameters) with a BART-based model (400M parameters).
翻译:神经语法错误修正(GEC)的进展受到标注训练数据不足的制约。由于缺乏足够数量的高质量人工标注数据,近年研究依赖生成合成数据、在其上预训练,再在真实数据集上进行微调;性能提升通常通过集成方法或使用如XXL-T5等大型预训练模型作为骨干网络来实现。本研究探索了一个正交方向:如何更高效地利用现有数据。首先,我们提出利用原始句子与修正句子之间对齐关系的辅助任务,例如预测修正序列。我们将每个任务构建为序列到序列问题,并实施多任务训练。其次,我们发现训练所用数据集的顺序,甚至数据集内单个实例的次序,都可能对最终性能产生重要影响,因此我们着手寻找最优训练计划。这两个思路共同带来了显著改进,在参数规模大幅缩小的情况下仍能超越现有最优水平;特别是,我们基于BART(400M参数)的模型超越了基于T5-XXL(110亿参数)的最佳模型。