As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Human evaluations are conventionally considered the gold standard in natural language generation, but recent advancements incorporate state-of-the-art LLMs as proxies for human judges in evaluation processes. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System. Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation and refinement.
翻译:随着大型语言模型(LLMs)的持续发展,准确全面地评估其性能变得日益具有挑战性。传统上,人类评估被视为自然语言生成领域的黄金标准,但近期的进展将最先进的LLMs作为评估过程中的人类评判代理。然而,人类和LLM作为合格评估者的程度仍不确定。本研究探讨了众包标注专家和LLMs在比较不同模型输出时的行为。为此,我们构建了一个包含故意存在缺陷的机器生成答案的数据集。研究结果揭示了评估过程中存在令人担忧的偏见:包含事实性错误的答案比过短或含有语法错误的答案获得更高的评分。为解决这一问题,我们提出从多个维度独立评估机器生成的文本,而非将所有评估方面合并为一个单一分数。我们将这一思路实例化为基于Elo评级系统的多维度Elo评级系统。实证结果表明,该方法显著提升了基于LLM的评估质量,尤其是在事实准确性方面。然而,基于众包的评估未见显著改进,这表明仍需进一步研究与完善。