Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.
翻译:现有语言模型基准测试提供的模型排名相互矛盾,即使对于旨在评估相似技能的基准测试也是如此。这种排名冲突的困境阻碍了模型选择,使模型比较变得模糊,并为日益增长的竞争模型生态系统增添了混乱。本文提出了一种不同的模型比较视角:不依赖直接评估的开箱即用性能,而是通过为每个模型提供相同的基准特定微调后再进行评估来比较模型潜力。我们将这种方法称为训练前测试。我们的主要贡献是对24个基准测试和61个模型进行了全面的模型潜力实证评估。首先,我们证明通过训练前测试获得的模型潜力排名在所有基准测试中表现出显著的一致性。传统排名在直接评估下几乎不具备外部效度,而应用训练前测试后则获得了显著的外部效度:模型潜力排名能够优雅地从一个基准测试迁移到另一个基准测试。其次,训练前测试恢复了在直接评估中丢失的困惑度与下游任务性能之间的关联。值得注意的是,即使是基础模型的预微调困惑度也能预测微调后的下游性能,这表明排名一致性反映的是模型内在潜力而非微调伪影。最后,训练前测试将模型-分数矩阵简化为本质上的秩一矩阵,表明模型潜力主要由一个潜在因子主导,该因子通过训练前测试得以揭示。我们的研究支持将训练前测试作为大语言模型基准测试的默认组成部分。