Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The unmeasured variance also creates an exploitable surface: model developers can optimize against measurement noise rather than genuine capability. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and projects the most efficient path to reducing total error. For benchmark builders, the same decomposition identifies which design choices contribute exploitable surface for gaming and prescribes designs that minimize it. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, projection-optimized pipelines outperform 73\% of possible naive pipelines against a human baseline. On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost. A small-sample variance estimation exercise is sufficient to derive confidence intervals that approach nominal coverage when the model includes the relevant pipeline facets, and to generate recommendations for reducing measurement error and improving benchmark robustness.

翻译：LLM评估决定了哪些模型会被部署、哪些安全标准会被采纳、以及哪些研究结论会被发表。然而，这些分数隐含着不确定性：改写提示词、更换评判模型或改变温度参数，都足以使结果发生显著变化，从而颠覆排名和结论。标准置信区间忽略了这种方差，导致覆盖率不足，且随着数据量增加而恶化。未测量的方差还创造了一个可被利用的表面：模型开发者可以针对测量噪声而非真正能力进行优化。本文从源头上分解了LLM评估流程的不确定性，区分了随数据量增加而缩小的方差与对研究者设计选择敏感的成分，并指出了降低总误差的最有效路径。对于基准构建者，同样的分解方法能够识别出哪些设计选择会形成可被利用的操纵表面，并提出最小化这种表面的设计方案。在意识形态标注、安全分类、MMLU基准测试以及一项经人工验证的宣传审计任务中，经投影优化的流程相对于人工基线，优于73%可能的朴素流程。在MMLU上，优化的预算分配在同等成本下，与标准的单提示评估相比，将估计误差减半。当模型中包含相关流程要素时，基于小样本方差估计即可推导出接近名义覆盖率的置信区间，并能据此生成降低测量误差、提升基准鲁棒性的建议。