This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose $Rephrasing Language Modeling$ (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-the-art results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks.
翻译:本文研究中文拼写纠正(CSC)任务,旨在检测并修正给定句子中的潜在拼写错误。当前最先进的方法将CSC视为序列标注任务,并在句子对上微调基于BERT的模型。然而,我们注意到在字符到字符的标注过程中存在一个关键缺陷——对错误的过度依赖。这与人类思维模式相悖,人类是基于完整句子的语义进行改写,而非仅依赖先前记忆的错误模式。这种反直觉的学习过程导致机器拼写纠正的泛化性和迁移性出现瓶颈。为解决此问题,我们提出改写语言建模(ReLM),该方法通过填充额外槽位来训练模型改写完整句子,而非进行字符级标注。这种新颖的训练范式在微调和零样本CSC基准测试中均达到最新最优结果,大幅超越先前方法。当CSC与其他任务联合训练时,我们的方法还能学习可迁移的语言表征。