Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
翻译:基准测试已成为评估大语言模型(LLMs)的核心方法。研究社区通常依赖模型在基准测试提示词上的平均性能来评估模型表现。这一做法暗含的假设是,基准测试中的提示词是从某个真实世界目标分布中随机抽样得到的。我们指出这通常并非实际情形,相反,我们认为目标分布会因具体应用场景而异。研究发现:(1)模型在不同测试提示词上的性能表现存在非随机关联性;(2)考虑测试提示词间的相关性会改变模型在主要基准测试中的排名;(3)导致这些相关性的解释因素包括语义相似性与大语言模型共有的失效点。