Chinese Text Error Correction (CTEC) aims to detect and correct errors in the input text, which benefits human's daily life and various downstream tasks. Recent approaches mainly employ Pre-trained Language Models (PLMs) to resolve CTEC task and achieve tremendous success. However, previous approaches suffer from issues of over-correction and under-correction, and the former is especially conspicuous in the precision-critical CTEC task. To mitigate the issue of overcorrection, we propose a novel model-agnostic progressive multitask learning framework for CTEC, named ProTEC, which guides a CTEC model to learn the task from easy to difficult. We divide CTEC task into three sub-tasks from easy to difficult: Error Detection, Error Type Identification, and Correction Result Generation. During the training process, ProTEC guides the model to learn text error correction progressively by incorporating these sub-tasks into a multi-task training objective. During the inference process, the model completes these sub-tasks in turn to generate the correction results. Extensive experiments and detailed analyses fully demonstrate the effectiveness and efficiency of our proposed framework.
翻译:中文文本纠错(CTEC)旨在检测并纠正输入文本中的错误,这对人类日常生活及各类下游任务具有重要价值。近期研究主要采用预训练语言模型(PLMs)解决CTEC任务并取得了显著成功。然而,现有方法存在过度纠错与纠错不足的问题,其中过度纠错在精度优先的CTEC任务中尤为突出。为缓解过度纠错问题,我们提出了一种新型模型无关的渐进多任务学习框架ProTEC,该框架引导CTEC模型由易到难地学习任务。我们将CTEC任务划分为三个难度递增的子任务:错误检测、错误类型识别与纠错结果生成。在训练过程中,ProTEC通过将这些子任务整合到多任务训练目标中,引导模型渐进式地学习文本纠错。在推理过程中,模型依次完成这些子任务以生成纠错结果。大量实验与详细分析充分证明了所提出框架的有效性与高效性。