Evaluating the quality of text generated by large language models (LLMs) remains a significant challenge. Traditional metrics often fail to align well with human judgments, particularly in tasks requiring creativity and nuance. In this paper, we propose Check-Eval, a novel evaluation framework leveraging LLMs to assess the quality of generated text through a checklist-based approach. Check-Eval can be employed as both a reference-free and reference-dependent evaluation method, providing a structured and interpretable assessment of text quality. The framework consists of two main stages: checklist generation and checklist evaluation. We validate Check-Eval on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and SummEval. Our results demonstrate that Check-Eval achieves higher correlations with human judgments compared to existing metrics, such as G-Eval and GPTScore, underscoring its potential as a more reliable and effective evaluation framework for natural language generation tasks. The code for our experiments is available at https://anonymous.4open.science/r/check-eval-0DB4.
翻译:评估大型语言模型(LLM)生成的文本质量仍然是一个重大挑战。传统评估指标往往与人类判断的一致性较差,尤其是在需要创造性和细微差别的任务中。本文提出Check-Eval,一种新颖的评估框架,它利用LLM通过基于检查清单的方法来评估生成文本的质量。Check-Eval既可作为无参考也可作为有参考的评估方法,为文本质量提供结构化且可解释的评估。该框架包含两个主要阶段:检查清单生成和检查清单评估。我们在两个基准数据集上验证了Check-Eval:葡萄牙法律语义文本相似性数据集和SummEval数据集。实验结果表明,与现有指标(如G-Eval和GPTScore)相比,Check-Eval与人类判断具有更高的相关性,这突显了其作为自然语言生成任务更可靠、更有效的评估框架的潜力。实验代码可在 https://anonymous.4open.science/r/check-eval-0DB4 获取。