We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.
翻译:我们通过社会选择理论的视角审视机器学习中的多任务基准测试。将基准测试与选举系统进行类比,其中模型相当于候选人,任务相当于选民。这一类比揭示了基数型基准测试与序数型基准测试之间的区别:前者通过聚合数值评分形成单一模型排名,后者则聚合每个任务的排名。针对序数型基准测试,我们应用阿罗不可能定理,阐明了序数系统的固有局限性,尤其是其对无关模型引入的敏感性。受阿罗定理启发,我们通过实验证明现有基准测试中多样性与对无关变化敏感性之间存在显著权衡。该结论基于我们引入的多样性和敏感性的新型量化指标:敏感性衡量任务无关变化对基准测试的影响程度,多样性则反映任务间模型排名的分歧程度。由于精确计算存在计算挑战,我们为两种指标开发了高效近似算法。通过在七个基数型基准测试和十一个序数型基准测试上的广泛实验,我们验证了多样性与稳定性之间的明确权衡:多任务基准测试的多样性越高,其对琐碎变化越敏感。此外,我们证明现有基准测试的聚合排名在无关变化下高度不稳定。相关代码与数据见 https://socialfoundations.github.io/benchbench/。