从理解到生成：一种评估语言模型的高效捷径 (From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models)

Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. Our project is available at: https://github.com/Fraunhofer-IIS/EvalShortcut

翻译：在训练过程中对大型语言模型进行迭代评估对于确保其能力按预期发展至关重要，但这一过程可能耗时且计算密集。虽然自然语言理解任务（模型从固定答案选项中进行选择）的评估成本低廉，但推理和代码生成等关键能力依赖于更耗时的自然语言生成（逐词元生成）格式。在本工作中，我们的目标是减轻自然语言生成基准测试的计算负担，以便能够在模型训练期间监测关键的大型语言模型能力。我们将生成式任务重新表述为计算成本更低的自然语言理解替代方案。我们使用8个不同规模的预训练语言模型和4种能力（数学推理、代码生成、事实性知识和阅读理解）测试了原始任务与重构任务之间的性能相关性。我们的结果表明，两种任务格式之间存在强相关性，这支持通过更廉价的替代方案进行能力评估，并实现了超过35倍的平均评估时间缩减。我们的项目地址为：https://github.com/Fraunhofer-IIS/EvalShortcut

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/