Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts, which has a wide range of applications. However, it is confronted with the challenges of insufficient annotated data and the issue that previous methods may actually not fully leverage the existing datasets. In this paper, we introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check (RERIC), which can be directly applied to existing CSC models. The datastore for retrieval is built completely based on the training data, with elaborate designs according to the characteristics of CSC. Specifically, we employ multimodal representations that fuse phonetic, morphologic, and contextual information in the calculation of query and key during retrieval to enhance robustness against potential errors. Furthermore, in order to better judge the retrieved candidates, the n-gram surrounding the token to be checked is regarded as the value and utilized for specific reranking. The experiment results on the SIGHAN benchmarks demonstrate that our proposed method achieves substantial improvements over existing work.
翻译:中文拼写检查(CSC)旨在检测并纠正中文文本中的错误标记,具有广泛的应用场景。然而,该方法面临标注数据不足的挑战,且现有方法可能未能充分利用已有数据集。本文提出了一种即插即用的检索方法——基于错误鲁棒信息的中文拼写检查检索(RERIC),可直接应用于现有CSC模型。该检索方法的数据存储完全基于训练数据构建,并根据CSC特性进行了精心设计:具体而言,我们在检索过程中融合了语音、形态和上下文信息的多模态表示来计算查询与键值,以增强对潜在错误的鲁棒性。此外,为更好地评估检索候选结果,我们将待检标记的n-gram上下文作为数值,并用于特定重排序。基于SIGHAN基准的实验结果表明,我们的方法较现有工作取得了显著改进。