Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
翻译:鉴于生成式人工智能的快速发展,亟需对现有众多模型及配置进行系统性比较与选择。此类评估的规模与多样性使得基于大语言模型的评判机制成为应对该挑战的有效方案。关键在于,此方法首先需要验证大语言模型评判者自身的质量。先前研究主要关注大语言模型评判者的实例级评估,即通过一组响应或响应对来评估评判者,而忽略其来源系统。我们认为这种设定忽视了影响系统级排序的关键因素,例如评判者对特定系统的正向或负向偏见。为填补这一空白,我们开展了首个针对大语言模型作为系统排序工具的大规模研究。系统得分通过聚合多个系统输出的评判分数生成,并通过将所得系统排序与基于人工的排序进行对比来评估评判者质量。除整体评判评估外,我们的分析还提供了评判者行为的细粒度特征描述,包括其决策倾向与偏见程度。