The increasing versatility of language models LMs has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs reaching thousands of GPU hours per model. However the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work we present the problem of Efficient Benchmarking namely intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case we investigate how different benchmark design choices affect the computation-reliability tradeoff. We propose to evaluate the reliability of such decisions by using a new measure Decision Impact on Reliability DIoR for short. We find for example that the current leader on HELM may change by merely removing a low-ranked model from the benchmark and observe that a handful of examples suffice to obtain the correct benchmark ranking. Conversely a slightly different choice of HELM scenarios varies ranking widely. Based on our findings we outline a set of concrete recommendations for more efficient benchmark design and utilization practices leading to dramatic cost savings with minimal loss of benchmark reliability often reducing computation by x100 or more.
翻译:语言模型(LMs)的多功能性日益增强,催生了一类新的基准测试,用于全面评估其广泛的能力。此类基准测试的计算成本极高,每个模型需耗费数千个GPU小时。然而,这些评估工作中的效率问题在文献中鲜有讨论。本文提出了“高效基准测试”问题,即在保证可靠性的前提下,智能地降低LM评估的计算成本。我们以HELM基准测试为案例,研究不同的基准设计选择如何影响计算与可靠性之间的权衡。我们提出了一种新的度量标准——决策对可靠性的影响(简称DIoR),用以评估此类决策的可靠性。例如,我们发现当前HELM的领先者可能仅因从基准测试中移除一个排名较低的模型而发生变化,并观察到少量样本就足以获得正确的基准排名。相反,选择略有不同的HELM场景则会导致排名发生显著变化。基于这些发现,我们提出了一系列具体建议,用于更高效地设计和利用基准测试,从而在最大程度上保持基准可靠性的同时大幅节省成本,通常可将计算量减少100倍或更多。