Human evaluations are often required for abstractive summary evaluations to give fairer judgments. However, they are often time-consuming, costly, inconsistent, and non-reproducible. To overcome these challenges, we explore the potential of using an out-of-the-box LLM (i.e. "gpt-3.5-turbo") for summarization evaluation without manually selecting demonstrations or complex prompt tuning. We compare different evaluation methods, including 2 methods for Likert-scale scoring and 1 method for head-to-head comparisons, to investigate the performance of the LLM as a zero-shot evaluator. We further propose a meta-correlation metric to measure the stability of the LLM's evaluation capability. With extensive experiments, we show that certain prompt formats can produce better results than others. We also bring attention to the LLM's deteriorating evaluation capability with the rising qualities of summaries. In addition, we find that the LLM's evaluation capability also depends on the evaluated dimensions. We discuss the pros and cons of each method, make recommendations, and suggest some future directions for improvement.
翻译:抽象式摘要评估通常需要人工评价以获得更公平的判断,然而人工评估往往耗时、昂贵、不一致且不可重复。为应对这些挑战,我们探索了直接使用现成的大型语言模型(如gpt-3.5-turbo)进行摘要评估的潜力,无需手动选择示例或复杂提示调优。我们比较了不同评估方法,包括两种李克特量表评分方法和一种两两比较方法,以研究大语言模型作为零样本评估器的性能。进一步提出元相关度量指标来衡量大语言模型评估能力的稳定性。通过大量实验,我们证明某些提示格式能产生优于其他格式的结果。同时,我们揭示了大语言模型随摘要质量提升而评估能力下降的现象。此外,我们发现大语言模型的评估能力还依赖于评估维度。本文讨论了每种方法的优缺点,提出建议,并指出了若干未来改进方向。