Chinese Text Error Correction (CTEC) aims to detect and correct errors in the input text, which benefits human's daily life and various downstream tasks. Recent approaches mainly employ Pre-trained Language Models (PLMs) to resolve CTEC task and achieve tremendous success. However, previous approaches suffer from issues of over-correction and under-correction, and the former is especially conspicuous in the precision-critical CTEC task. To mitigate the issue of overcorrection, we propose a novel model-agnostic progressive multitask learning framework for CTEC, named ProTEC, which guides a CTEC model to learn the task from easy to difficult. We divide CTEC task into three sub-tasks from easy to difficult: Error Detection, Error Type Identification, and Correction Result Generation. During the training process, ProTEC guides the model to learn text error correction progressively by incorporating these sub-tasks into a multi-task training objective. During the inference process, the model completes these sub-tasks in turn to generate the correction results. Extensive experiments and detailed analyses fully demonstrate the effectiveness and efficiency of our proposed framework.
翻译:中文文本纠错(CTEC)旨在检测并纠正输入文本中的错误,这对人类日常生活及各类下游任务具有重要意义。现有方法主要采用预训练语言模型(PLMs)来解决CTEC任务并取得了巨大成功。然而,先前方法存在过度纠正和纠正不足的问题,尤其在精度敏感的CTEC任务中,过度纠正现象尤为突出。为缓解过度纠正问题,我们提出了一种新颖的模型无关渐进式多任务学习框架ProTEC,该框架引导CTEC模型按照由易到难的顺序学习任务。我们将CTEC任务按难度递增划分为三个子任务:错误检测、错误类型识别和纠错结果生成。在训练过程中,ProTEC通过将这些子任务整合到多任务训练目标中,引导模型渐进式地学习文本纠错。在推理过程中,模型依次完成这些子任务以生成纠错结果。大量实验和详细分析充分证明了所提框架的有效性和效率。