Evaluating text summarization is a challenging problem, and existing evaluation metrics are far from satisfactory. In this study, we explored ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods on five datasets. We found that ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation. Additionally, it outperformed commonly used automatic evaluation metrics on some datasets. Furthermore, we discussed the impact of different prompts, compared its performance with that of human evaluation, and analyzed the generated explanations and invalid responses.
翻译:评估文本摘要是具有挑战性的问题,现有评估指标远未令人满意。本研究在五个数据集上采用四种人工评估方法,探索了ChatGPT执行类人化摘要评估的能力。我们发现,ChatGPT在使用李克特量表评分、成对比较、金字塔方法及二元事实性评估时均能相对流畅地完成标注任务,且在部分数据集上优于常用自动评估指标。此外,我们探讨了不同提示词的影响,将其性能与人工评估进行对比,并分析了生成的解释与无效响应。