Separating signal from noise is central to experiments. Applying well-established statistical methods effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings, revealing clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. By measuring all the noises together, we can assess eval results in context, lowering the barrier of using the best analysis to make sound empirical decisions.
翻译:从噪声中分离信号是实验的核心。将成熟的统计方法有效应用于大语言模型评测,需要考虑其独特的噪声特征。我们明确定义并测量了三种噪声:给定问题上因生成不同答案产生的预测噪声、采样问题产生的数据噪声,以及遵循总方差定律的二者组合总噪声。为突出相对比较并增强统计功效,我们提出了全配对配对方法,该方法对全部大模型对进行配对分析,基于跨多个评测场景的数百万问题级预测测量所有噪声成分,揭示了清晰规律。首先,每个评测在所有模型对上呈现出特征性且高度可预测的总噪声水平。其次,配对预测噪声通常超过配对数据噪声,这意味着通过平均化降低预测噪声能显著提升统计功效。通过共同测量所有噪声,我们能在上下文中评估评测结果,从而降低采用最优分析方法做出可靠实证决策的门槛。