Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. Our project is available at: https://github.com/Fraunhofer-IIS/EvalShortcut
翻译:在训练过程中对大型语言模型进行迭代评估对于确保其能力按预期发展至关重要,但这一过程可能耗时且计算密集。虽然自然语言理解任务(模型从固定答案选项中进行选择)的评估成本低廉,但推理和代码生成等关键能力依赖于更耗时的自然语言生成(逐词元生成)格式。在本工作中,我们的目标是减轻自然语言生成基准测试的计算负担,以便能够在模型训练期间监测关键的大型语言模型能力。我们将生成式任务重新表述为计算成本更低的自然语言理解替代方案。我们使用8个不同规模的预训练语言模型和4种能力(数学推理、代码生成、事实性知识和阅读理解)测试了原始任务与重构任务之间的性能相关性。我们的结果表明,两种任务格式之间存在强相关性,这支持通过更廉价的替代方案进行能力评估,并实现了超过35倍的平均评估时间缩减。我们的项目地址为:https://github.com/Fraunhofer-IIS/EvalShortcut