EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou,Ming Zhang,Chenhao Huang,Jiayi Chen,Feng Chen,Shichun Liu,Yan Liu,Chenxiao Liu,Cheng Zhong,Zongzhang Zhang,Tao Gui,Chao Xin,Chengzhi Wei,Lin Yan,Yonghui Wu,Qi Zhang,Xuanjing Huang

from arxiv, Accepted by NeurIPS 2025. 47 pages, 24 figures

We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.

翻译：我们提出了EvaLearn，这是一个开创性的基准测试，旨在评估大语言模型（LLMs）在挑战性任务中的学习能力与效率，这是模型潜力中至关重要但尚未被充分探索的方面。EvaLearn包含六种任务类型下的648个挑战性问题，被组织成182个序列，每个序列专注于一种任务类型。与大多数现有并行评估模型的基准不同，EvaLearn要求模型顺序解决问题，使其能够利用从前序解决方案中获得的经验。EvaLearn提供了五个全面的自动化指标来评估模型并量化其学习能力与效率。我们广泛地对九个前沿模型进行了基准测试，并观察到不同的性能表现：一些模型（如Claude-3.7-sonnet）初始表现中等但展现出强大的学习能力，而另一些模型则难以从经验中获益，甚至可能出现负迁移。此外，我们研究了模型在两种学习设置下的表现，发现实例级评分标准和教师模型反馈能进一步促进模型学习。重要的是，我们观察到当前静态能力更强的LLMs并非在所有任务的学习能力上都表现出明显优势，这突显了EvaLearn评估的是模型性能的一个新维度。我们希望EvaLearn能为评估LLM潜力及理解模型与人类能力之间的差距提供一个新颖的评估视角，促进更深入、更动态的评估方法的发展。本文研究的所有数据集、自动评估框架及结果均可在GitHub仓库中获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日