Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.
翻译:利用大型语言模型(LLM)评估文本质量近期广受关注。部分先前研究探索了使用LLM进行评估的思路,但在评估过程的具体细节上存在差异。本文分析了LLM评估方法(Chiang and Lee, 2023)与G-Eval(Liu et al., 2023),并探讨了评估流程中的细节差异如何影响LLM评分与人工评分的一致性。研究发现,G-Eval中使用的自动思维链(Chain-of-Thought, CoT)并不总能提升其与人类评分的对齐程度。同时表明,如G-Eval般强制要求LLM仅输出数值评分并非最优方案。最后,我们揭示要求LLM对其评分进行解释可持续提升ChatGPT评分与人工评分间的相关性,并在两个元评估数据集上实现了当前最优(state-of-the-art, SoTA)的相关系数。