While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench.
翻译:尽管语言模型(LLM)在任务上从易到难的泛化能力对其性能剖析至关重要,但目前仍缺乏在广泛复杂度范围内为每个问题提供细粒度难度标注的数据集。为弥补这一不足,我们提出了Easy2Hard-Bench,这是一个格式统一的基准数据集集合,包含涵盖数学与编程问题、国际象棋谜题及推理问题等多个领域的6个数据集。这些数据集中的每个问题均标注有数值化的难度分数。为系统性地估计问题难度,我们收集了现实世界中人类或主流排行榜上LLM对每个问题的大量尝试性能数据。利用这些丰富的性能数据,我们应用成熟的难度排序系统,如项目反应理论(IRT)和Glicko-2模型,为问题统一分配数值难度分数。此外,Easy2Hard-Bench中的数据集相较于以往的数据集,其挑战性问题的比例更高。通过对六个最先进LLM的广泛实验,我们全面分析了它们在不同难度级别上的性能与泛化能力,旨在启发未来关于LLM泛化能力的研究。数据集可通过 https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench 获取。