Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: https://github.com/a-brassard/ACORN.
翻译:评估自由文本解释是一项多维度、主观且劳动密集型的任务。大语言模型(LLMs)因其在一致性、可扩展性和成本效益方面的潜力,成为极具吸引力的替代方案。本研究提出ACORN——一个包含3500条自由文本解释及其方面级质量评分的新数据集,并以此深入探究LLMs如何评估解释。我们观察到,在不同设置和质量维度下,用LLM评分替换其中一条人工评分时,有时能保持但更多时候会降低标注者间一致性,表明其判断并非始终与人类标注者一致。我们进一步通过比较LLM生成评分与多数投票人工评分在不同质量维度上的相关性,量化了这种差异。在最佳系统中,斯皮尔曼等级相关系数介于0.53至0.95之间,各维度平均值为0.72,表明存在中等偏高但非完美的一致性。最后,我们考虑在人类标注者不足时,将LLM用作额外标注者的替代方案,并测量了有限人类池与LLM作为额外标注者的多数投票标签与原始黄金标签之间的相关性。当仅有两位人类标注者时,GPT-4能改善结果;但在所有其他观测案例中(人类标注者≥3位时),LLM的影响呈中性甚至负面。我们公开发布该数据集以支持未来LLM参与评估的改进研究:https://github.com/a-brassard/ACORN。