Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.
翻译:随着基于大语言模型(LLM)的语法错误纠正(GEC)系统产生的修正结果常与提供的黄金参考存在偏差,评估GEC模型的性能变得日益困难。这种差异削弱了传统基于参考的评价指标的可靠性。在本研究中,我们提出了一种新颖的GEC模型评估框架DSGram,该框架整合了语义连贯性、编辑层级和流畅度,并采用动态加权机制。我们的框架运用层次分析法(AHP)结合大语言模型来确定各评价标准的相对重要性。此外,我们构建了一个包含人工标注和LLM模拟句子的数据集,以验证我们的算法并微调更具成本效益的模型。实验结果表明,我们提出的方法提升了GEC模型评估的有效性。