Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities *per dollar*. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5\times$ to $10\times$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3\times$ per year. However, at the same time, the price of running frontier models is rising between $3\times$ to $18\times$ per year due to bigger models and larger reasoning demands. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.
翻译:近年来,语言模型在高级基准测试中取得了显著进展,但这一进步在很大程度上依赖于成本更高的模型。因此,基准测试可能扭曲了每单位成本下实际能力提升的真实图景。为解决这一问题,我们整合了来自Artificial Analysis和Epoch AI的数据,构建了迄今为止最全面的当前与历史基准运行价格数据集。研究发现,在知识、推理、数学及软件工程等领域的顶尖模型中,达到特定基准性能水平的成本正以每年约5至10倍的惊人速度下降。AI推理成本的降低源于经济因素、硬件效率提升以及算法效率优化。在剔除开源模型以控制竞争效应,并除以硬件价格下降幅度后,我们估算出算法效率每年约提升3倍。然而与此同时,受模型规模扩大和推理需求增加的影响,运行顶尖模型的成本正以每年3至18倍的速率上升。最后,我们建议评估者将基准测试的成本视为衡量AI实际影响力的核心要素,并予以公开与考量。