We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.
翻译:我们研究在有限人工标注预算下,大语言模型评审小组应何时使用低维度堆叠器与联合输出表格进行标定。低维度堆叠器估计成本低但忽略交互作用,而联合表格标定器能表示交互作用却需为单元格计数和未观测模式付出代价。我们将此权衡建模为有限标定区间图,并实例化为有限标定面板选择——一种可部署的验证选择器,其基于评审路径、前缀规模、聚合器族以及表格与参数估计诊断指标进行验证。在包含DeepSeek V4 Flash等七评审池的RewardBench、LLMBar、SummEval和Arena100K数据集上,标量/可靠性聚合在20个真实数据集-预算单元中赢得16个,表明当前评审输出通常具有可加性或冗余性。受控的标定增长实验数据展示了互补区间:可加性标注仍偏好标量方法,而六向交互作用则选择更大的联合表格,当未观测质量消失后其测试均方误差从0.224降至0.061。因此,实际问题并非“需要多少评审员?”,而是下一个评审员的信息能否在可用人工标注下实现有效估计。