Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.
翻译:预训练模型通常通过多任务排行榜来评估其在不同场景中的适用性。然而,当前将各任务性能汇总为排行榜级别排名的方法未能解决任务层面的不确定性与变异性。尽管近期研究提出了基于区间的模型排名方法,但如何原则性地将单个任务的不确定性聚合至排行榜级别排名的问题仍未得到解决,模型在不同任务上的性能差异也常被掩盖。本研究提出一种分层框架,该框架在统计保证下构建两个层面的模型排名区间:通过成对比较获得任务级别的排名置信区间,并利用共形方法构建排行榜级别的排名预测区间。这使得我们能够对每个已观测任务及潜在新任务的模型排名进行可靠量化。在模拟数据以及TabArena和PromptEval(MMLU)基准上的实验表明,该方法能生成统计有效且具有信息量的区间,从而实现排行榜上具有不确定性感知的可靠模型排名。