We introduce CheckEval, a novel evaluation framework using Large Language Models, addressing the challenges of ambiguity and inconsistency in current evaluation methods. CheckEval addresses these challenges by dividing evaluation criteria into detailed sub-aspects and constructing a checklist of Boolean questions for each, simplifying the evaluation. This approach not only renders the process more interpretable but also significantly enhances the robustness and reliability of results by focusing on specific evaluation dimensions. Validated through a focused case study using the SummEval benchmark, CheckEval indicates a strong correlation with human judgments. Furthermore, it demonstrates a highly consistent Inter-Annotator Agreement. These findings highlight the effectiveness of CheckEval for objective, flexible, and precise evaluations. By offering a customizable and interactive framework, CheckEval sets a new standard for the use of LLMs in evaluation, responding to the evolving needs of the field and establishing a clear method for future LLM-based evaluation.
翻译:我们提出CheckEval——一种利用大语言模型的新型评估框架,旨在解决当前评估方法中存在的模糊性与不一致性挑战。通过将评估标准拆解为多个子维度,并为每个维度构建布尔问题清单,CheckEval简化了评估流程。该方法不仅提升了评估过程的可解释性,更通过聚焦特定评估维度显著增强了结果的鲁棒性与可靠性。基于SummEval基准的聚焦案例研究表明,CheckEval与人类判断呈现高度相关性,并展现出极强的一致性标注者间信度。这些发现凸显了CheckEval在实现客观、灵活且精准评估方面的有效性。通过提供可定制化与交互式框架,CheckEval为大语言模型在评估领域的应用树立了新标准,既回应了领域发展的动态需求,也为未来基于大语言模型的评估确立了清晰方法论。