Instance-Optimal Estimation with Multiple LLM Judges on a Budget

Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as *budgeted heteroskedastic multi-judge estimation*. Given $K$ prompt-response pairs, $J$ judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an $\ell_p$-error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages *optimistically biased* variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching *local* minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over naïve uniform allocation on synthetic and HelpSteer2 datasets.

翻译：评估大型语言模型日益依赖“LLM作为评委”协议，但此类评估成本高昂：不同评委具有不同价格和可靠性，且每个提示-响应对的难度差异显著。这引出了一个基本分配问题：在固定预算下，如何将评估查询分配给异构评委和实例，以获取最准确的分数估计？我们将此问题形式化为*预算约束异方差多评委估计*。给定 $K$ 个提示-响应对、$J$ 个已知成本的评委以及未知的查询-评委方差，目标是在最小化 $\ell_p$ 误差的同时估计有界分数向量。我们的第一个贡献是分析逆方差加权估计量（IVWE），并推导出最小化其误差率的预言分配。由于该分配依赖于未知方差，我们随后针对未知方差的实际场景提出EST-IVWE，这是一种自适应算法，通过构造并利用*乐观有偏*方差估计来稳定经验分配。我们证明EST-IVWE在预算上与预言IVWE的收敛速率匹配，误差仅含低阶项。我们的第二个核心理论贡献是匹配的*局部*极小化最优下界，这确立了所提算法的实例最优性。关键的技术见解在于：Fano型高概率论证对此问题过于粗糙——其打包构造丢失了支配最优分配的局部方差结构。我们转而采用基于局部扰动的Assouad型期望论证，该论证保留此结构并导出尖锐的分配相关下界。最后，我们在合成数据集和HelpSteer2数据集上通过数值实验验证了本方法相较于朴素均匀分配的优越性。