ChatGPT has demonstrated impressive performance in various downstream tasks. However, in the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics. We believe this inconsistency arises because the traditional metrics are not well-suited for evaluating generative models. Their overly strict length and phonics constraints may lead to underestimating ChatGPT's correction capabilities. To better evaluate generative models in the CSC task, this paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints. Experimental results show that Eval-GCSC closely aligns with human evaluations. Under this metric, ChatGPT's performance is comparable to traditional token-level classification models (TCM), demonstrating its potential as a CSC tool. The source code and scripts can be accessed at https://github.com/ktlKTL/Eval-GCSC.
翻译:摘要:ChatGPT已在多种下游任务中展现出卓越性能。然而,在中文拼写纠正(CSC)任务中,我们观察到一种差异:尽管ChatGPT在人工评估中表现良好,但根据传统指标其得分较低。我们认为这种不一致源于传统指标并不适用于生成式模型的评估——其过于严格的长度与音韵约束可能导致低估ChatGPT的纠错能力。为更优评估CSC任务中的生成式模型,本文提出新评估指标Eval-GCSC。该指标通过引入词级与语义相似性判断,放宽了严格的长度及音韵约束。实验表明,Eval-GCSC与人工评估高度吻合。在该指标下,ChatGPT的性能可媲美传统词元级分类模型(TCM),展现出其作为CSC工具的潜力。源代码及脚本可通过https://github.com/ktlKTL/Eval-GCSC获取。