One key characteristic of the Chinese spelling check (CSC) task is that incorrect characters are usually similar to the correct ones in either phonetics or glyph. To accommodate this, previous works usually leverage confusion sets, which suffer from two problems, i.e., difficulty in determining which character pairs to include and lack of probabilities to distinguish items in the set. In this paper, we propose a light-weight plug-and-play DISC (i.e., decoding intervention with similarity of characters) module for CSC models.DISC measures phonetic and glyph similarities between characters and incorporates this similarity information only during the inference phase. This method can be easily integrated into various existing CSC models, such as ReaLiSe, SCOPE, and ReLM, without additional training costs. Experiments on three CSC benchmarks demonstrate that our proposed method significantly improves model performance, approaching and even surpassing the current state-of-the-art models.
翻译:中文拼写检查任务的一个关键特征是,错误字符通常在语音或字形上与正确字符相似。为适应这一特点,先前研究通常利用混淆集,但该方法存在两个问题:难以确定应包含哪些字符对,以及缺乏区分集合内项目的概率信息。本文提出一种轻量级即插即用DISC模块用于中文拼写检查模型。DISC通过量化字符间的语音与字形相似度,仅在推理阶段引入相似性信息。该方法可无缝集成至多种现有中文拼写检查模型,如ReaLiSe、SCOPE和ReLM,且无需额外训练成本。在三个中文拼写检查基准测试上的实验表明,所提方法显著提升了模型性能,达到甚至超越了当前最先进模型水平。