We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data. We further perform detailed analyses of the connections and gaps between our domains from both empirical and statistical views. We hope this work can inspire future studies on an important but under-explored direction--cross-domain GEC.
翻译:我们提出了NaSGEC,这是一个旨在促进多领域母语文本中文语法纠错(CGEC)研究的新数据集。以往的CGEC研究主要集中于单领域文本(特别是学习者作文)的纠错。为了拓展目标领域,我们针对来自社交媒体、科学写作和考试三个母语领域的12,500个句子标注了多个参考译文。通过采用前沿的CGEC模型及不同的训练数据,我们为NaSGEC提供了可靠的基准实验结果。我们还从实证和统计两个角度,深入分析了各领域之间的关联性与差异。希望这项工作能激发对未来一个重要但尚未充分探索的方向——跨领域语法纠错(GEC)的研究。