Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge.
翻译:大型语言模型(LLM)越来越多地被用作生成式人工智能输出的自动评估器,这一范式通常被称为“LLM作为评判器”。在实践中,LLM评判器对潜在真实情况的预测并不完美,可能表现出系统性、非随机的误差。近期提出了两种主要方法来解决这一问题:(i)基于误分类模型(如Rogan-Gladen型估计器)的直接测量误差校正;(ii)替代结果方法,如预测驱动推理(PPI),该方法通过在少量黄金标准人工标注数据上校准预测残差来纠正偏差。本文系统研究了这两种方法在估计均值参数(例如平均基准分数或成对胜率)时的性能。借助半参数效率理论工具,我们通过推导基于高效影响函数(EIF)的高效估计器的显式形式,统一了这两类估计器,并刻画了PPI型估计器获得严格小于测量误差校正方法的渐近方差的充分条件。我们在仿真中验证了理论结果,并在实际数据示例中展示了这些方法。我们在https://github.com/yiqunchen/debias-llm-as-a-judge提供了基准方法的实现和比较工具。