We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.
翻译:我们提出了EvaLearn,这是一个开创性的基准测试,旨在评估大语言模型(LLMs)在挑战性任务中的学习能力与效率,这是模型潜力中至关重要但尚未被充分探索的方面。EvaLearn包含六种任务类型下的648个挑战性问题,被组织成182个序列,每个序列专注于一种任务类型。与大多数现有并行评估模型的基准不同,EvaLearn要求模型顺序解决问题,使其能够利用从前序解决方案中获得的经验。EvaLearn提供了五个全面的自动化指标来评估模型并量化其学习能力与效率。我们广泛地对九个前沿模型进行了基准测试,并观察到不同的性能表现:一些模型(如Claude-3.7-sonnet)初始表现中等但展现出强大的学习能力,而另一些模型则难以从经验中获益,甚至可能出现负迁移。此外,我们研究了模型在两种学习设置下的表现,发现实例级评分标准和教师模型反馈能进一步促进模型学习。重要的是,我们观察到当前静态能力更强的LLMs并非在所有任务的学习能力上都表现出明显优势,这突显了EvaLearn评估的是模型性能的一个新维度。我们希望EvaLearn能为评估LLM潜力及理解模型与人类能力之间的差距提供一个新颖的评估视角,促进更深入、更动态的评估方法的发展。本文研究的所有数据集、自动评估框架及结果均可在GitHub仓库中获取。