There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.
翻译:目前存在极其广泛的LLM基准测试任务,而决策过程中(尤其是非专家决策)往往最需要的是单一数值指标。当前所有非Elo体系的聚合方案均存在成本高昂或耗时过长的问题。本文提出了一种在通用基准空间内聚合性能的方法,代号"MPG"项目(模型性能与优良度),该命名同时影射汽车工业中广为人知但存在缺陷的粗粒度性能指标。我们构建了两个核心指标:"优良度"(答案准确率)和"迅捷度"(推理成本或每秒查询数)。通过模型间对比分析,我们依据综合度量及子领域指标呈现排序结果。研究发现,本方法得分与Chatbot Arena的原始皮尔逊相关性具有显著一致性,其相关性甚至优于MMLU排行榜与Chatbot Arena的关联度。