Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.
翻译:现代基准测试(如HELM MMLU)需考虑准确性、鲁棒性和效率等多个指标。当试图将这些指标整合为单一排名时,自然的聚合过程可能变得不一致或对模型集合的变化不稳定。我们将此聚合形式化为一个社会选择问题:每个指标在每一数据集上诱导出对模型的偏好排序,而基准测试算子则跨指标聚合这些“投票”。尽管先前研究多聚焦于阿罗不可能性定理,我们认为该不可能性常源于病态示例,并识别出使这些病态情形消失的充分条件,从而使有意义的多标准基准测试成为可能。具体而言,我们处理了对排序组合的三类限制,并证明在单峰偏好、群组可分离偏好及距离限制偏好下,基准测试算子能够为相关模型构建性质良好的排序。实证方面,我们考察了HELM MMLU等多个现代基准测试套件,验证了哪些基准问题满足何种结构性条件。