Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
翻译:评估对于理解大型语言模型(LLMs)的能力至关重要。从本质上讲,评估就是实验;但关于评估的文献在很大程度上忽视了其他科学领域关于实验分析与规划的文献。本文向具备一定统计学背景的研究者展示了如何思考和分析语言模型评估数据。通过将评估问题概念化为来自一个未知超总体的抽样,我们提出了用于分析评估数据、测量两个模型之间差异以及规划评估实验的公式。我们提出了一系列具体建议,旨在以最小化统计噪声并最大化信息量的方式运行语言模型评估和报告实验结果。