Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. Our project page is at: https://minnesotanlp.github.io/cobbler.
翻译:大型语言模型(LLMs)近期被证明可通过简单提示和上下文学习有效作为自动评估者。本研究汇集了来自四种不同规模范围的15个LLM,通过其他LLM作为评估者对其输出响应进行偏好排序(例如系统星优于系统方)。我们进一步引入LLM评估者认知偏见基准(CoBBLEr),该基准可测量LLM评估输出中的六种不同认知偏见,例如自我中心偏见(模型优先评价自身输出)。研究发现LLM是存在偏差的文本质量评估者,其所有评估中均表现出强烈偏见迹象(所有模型的平均比较中有40%涉及偏见),这质疑了其作为评估者的稳健性。此外,我们检验了人类与机器偏好之间的相关性,计算得到平均排名偏置重叠(RBO)得分为49.6%,表明机器偏好与人类存在偏差。根据研究结果,LLM目前可能仍无法用于与人类偏好对齐的自动标注。项目页面:https://minnesotanlp.github.io/cobbler。