Recently, much Chinese text error correction work has focused on Chinese Spelling Check (CSC) and Chinese Grammatical Error Diagnosis (CGED). In contrast, little attention has been paid to the complicated problem of Chinese Semantic Error Diagnosis (CSED), which lacks relevant datasets. The study of semantic errors is important because they are very common and may lead to syntactic irregularities or even problems of comprehension. To investigate this, we build the CSED corpus, which includes two datasets. The one is for the CSED-Recognition (CSED-R) task. The other is for the CSED-Correction (CSED-C) task. Our annotation guarantees high-quality data through quality assurance mechanisms. Our experiments show that powerful pre-trained models perform poorly on this corpus. We also find that the CSED task is challenging, as evidenced by the fact that even humans receive a low score. This paper proposes syntax-aware models to specifically adapt to the CSED task. The experimental results show that the introduction of the syntax-aware approach is meaningful.
翻译:近年来,大量中文文本纠错工作聚焦于中文拼写检查(CSC)和中文语法错误诊断(CGED)。相比之下,复杂的语义层面错误诊断问题——中文语义错误诊断(CSED)却鲜有关注,且缺乏相关数据集。语义错误的研究具有重要价值,因其普遍存在,可能导致句法异常甚至理解障碍。为探究该问题,我们构建了CSED语料库,该语料包含两个数据集:一个面向CSED-识别(CSED-R)任务,另一个面向CSED-校正(CSED-C)任务。通过质量保障机制,我们的标注确保了数据的高质量。实验表明,强大的预训练模型在该语料上表现不佳。同时,CSED任务极具挑战性,即使人类标注者也仅获得较低评分。本文提出语法感知模型以专门适应CSED任务,实验结果表明引入语法感知方法具有显著意义。