Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.
翻译:评估大型语言模型(LLMs)的成本高昂:需要在包含各类任务的大规模基准测试中生成并检查LLM的输出。本文研究了如何在不影响评估质量的前提下,高效缩减用于基准测试LLMs的任务数量。我们的研究表明,任务可迁移性与相关性为通过优化设施选址函数来识别最具代表性的任务子集提供了关键信息。我们提出了一种基于上下文学习(ICL)的实用高效度量方法,用于估计两个任务间的可迁移性。通过分析成对可迁移性,我们可以将现代LLM基准测试(如MMLU或FLAN)中的任务缩减至5%,同时使评估结果与原基准测试的差异小于4%。与先前工作相比,我们的方法无需训练、无需梯度计算,且仅需上下文学习即可实现高效率。