量化大型语言模型评估中的构念效度 (Quantifying construct validity in large language model evaluations)

The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.

翻译：LLM 社区通常将基准测试结果视为模型通用能力的同义词。然而，基准测试可能存在扭曲性能的问题，例如测试集污染和标注者错误。我们如何知道一个基准测试是衡量我们想要测量的某种能力的可靠指标？这个问题涉及 LLM 基准测试的构念效度，它要求我们在建模和预测 LLM 性能时，将基准测试结果与能力区分开来。社会科学家和计算机科学家都提出了形式化模型——潜在因子模型和缩放定律——用于识别基准分数背后的能力。然而，这两种技术对于构念效度都不令人满意。潜在因子模型忽略了缩放定律，因此它们提取的能力常常只是模型规模的代理。缩放定律忽略了测量误差，因此它们提取的能力既难以解释，又对观察到的基准测试过拟合。本文提出了结构化能力模型，这是第一个能够从大量 LLM 基准测试结果中提取可解释且可泛化能力的模型。我在 OpenLLM 排行榜的大量结果样本上拟合了该模型及其两种替代模型。结构化能力在简约拟合指数上优于潜在因子模型，并且在分布外基准预测方面表现出比缩放定律更好的性能。这些改进之所以可能，是因为现有方法都没有以恰当的方式将模型规模与能力分离开来。模型规模应能反映能力，正如缩放定律所示；而这些能力应在考虑测量误差的前提下影响观测结果，正如潜在因子模型所示。通过结合这两种见解，结构化能力在量化 LLM 评估中的构念效度方面，展现出更好的解释力和预测力。