As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Conventionally, human evaluations are considered the gold standard in natural language generation. Recent advancements incorporate state-of-the-art LLMs as proxies for human judges in evaluation processes. Nonetheless, the extent to which humans and LLMs are capable evaluators remains uncertain. This study aims to investigate the behavior of both crowd-sourced human and LLM-based judges when comparing outputs from different models. To accomplish this, we curate a dataset comprising intentionally flawed machine-generated answers. Our findings indicate that despite the potentially greater danger posed by factual errors, answers with factual errors were still rated more favorably compared to answers that were too short or contained grammatical errors. This highlights a concerning bias in the evaluation process. To address this issue, we propose to independently evaluate machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System. Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, notable improvement is not observed in crowd-sourced-based evaluations, suggesting the need for further investigation and refinement.
翻译:随着大型语言模型(LLM)的持续进步,准确而全面地评估其性能变得日益具有挑战性。传统上,人工评估被视为自然语言生成的黄金标准。近期的进展将最先进的LLM作为人类评判者的代理纳入评估过程。然而,人类和LLM作为评估者的能力程度仍不确定。本研究旨在调查众包人类评判者和基于LLM的评判者在比较不同模型输出时的行为。为此,我们整理了一个包含故意存在缺陷的机器生成答案的数据集。我们的研究结果表明,尽管事实性错误可能带来更大的潜在危险,但包含事实性错误的答案相较于过短或包含语法错误的答案,仍获得更有利的评价。这突显了评估过程中令人担忧的偏见。为解决这一问题,我们提出独立地在多个维度上评估机器生成的文本,而不是将所有评估方面合并为一个单一分数。我们通过Elo评分系统实例化这一想法,形成了多维度Elo评分系统。我们研究中的实证结果表明,所提出的方法显著提升了基于LLM评估的质量,特别是在事实准确性方面。然而,在基于众包的评估中并未观察到显著改进,这表明需要进一步的研究和优化。