Evaluating large language models (LLMs) typically requires thousands of benchmark items, making the process expensive, slow, and increasingly impractical at scale. Existing evaluation protocols rely on average accuracy over fixed item sets, treating all items as equally informative despite substantial variation in difficulty and discrimination. We introduce ATLAS, an adaptive testing framework based on Item Response Theory (IRT) that estimates model ability using Fisher information-guided item selection. ATLAS reduces the number of required items by up to 90% while maintaining measurement precision. For instance, it matches whole-bank ability estimates using only 41 items (0.157 MAE) on HellaSwag (5,600 items). We further reconstruct accuracy from ATLAS's ability estimates and find that reconstructed accuracies closely match raw accuracies across all five benchmarks, indicating that ability $θ$ preserves the global performance structure. At the same time, $θ$ provides finer discrimination within accuracy-equivalent models: among more than 3,000 evaluated models, 23-31% shift by more than 10 rank positions, and models with identical accuracies receive meaningfully different ability estimates. Code and calibrated item banks are available at https://github.com/Peiyu-Georgia-Li/ATLAS.git.
翻译:评估大型语言模型通常需要数千个基准测试项,导致评估过程成本高昂、速度缓慢,且在大规模应用时日益不切实际。现有的评估协议依赖于固定测试项集合上的平均准确率,将所有测试项视为同等信息量,而忽略了难度与区分度存在的显著差异。本文提出ATLAS,一种基于项目反应理论的自适应测试框架,该框架通过费舍尔信息引导的测试项选择来估计模型能力。ATLAS在保持测量精度的同时,可将所需测试项数量减少高达90%。例如,在HellaSwag基准(含5,600个测试项)上,仅使用41个测试项(平均绝对误差为0.157)即可获得与全测试库相当的能力估计值。我们进一步从ATLAS的能力估计值重构准确率,发现重构准确率在所有五个基准测试中均与原始准确率高度吻合,这表明能力参数$θ$保留了全局性能结构。同时,$θ$在准确率等效的模型间提供了更精细的区分:在评估的3,000多个模型中,23-31%的模型排名变化超过10位,且准确率相同的模型获得了具有显著差异的能力估计值。代码与校准后的测试项库已发布于https://github.com/Peiyu-Georgia-Li/ATLAS.git。