The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
翻译:大型语言模型(LLM)输出的评估正越来越多地由其他LLM执行,这种设置通常被称为“LLM即评判者”或自动评分器。尽管自动评分器为人工评估提供了可扩展的替代方案,但其可靠性表现不一,并可能表现出系统性偏差,具体取决于响应类型、评分方法、领域特异性或其他因素。本文提出了一种基于贝叶斯广义线性模型(GLMs)的统计框架,使研究人员能够在解决其核心研究问题(例如LLM评估)的同时,评估其自动评分器。我们的方法将评估结果(例如分数或成对偏好)建模为评分者属性(例如人工评分者与自动评分者)与被评估项目属性(例如响应长度或生成该响应的LLM)的函数,从而允许在统一框架内对评分差异和潜在偏差进行显式量化。此外,我们的方法可用于增强传统指标(如评分者间一致性),通过提供不确定性估计并阐明分歧来源。总体而言,该方法有助于在LLM评估中更稳健、可解释地使用自动评分器,同时支持性能分析和偏差检测。