Large Language Models (LLMs) show promising potential in Software Engineering, especially for code-related tasks like code completion and code generation. LLMs' evaluation is generally centred around general metrics computed over benchmarks. While painting a macroscopic view of the benchmarks and of the LLMs' capacity, it is unclear how each programming task in these benchmarks assesses the capabilities of the LLMs. In particular, the difficulty level of the tasks in the benchmarks is not reflected in the score used to report the performance of the model. Yet, a model achieving a 90% score on a benchmark of predominantly easy tasks is likely less capable than a model achieving a 90% score on a benchmark containing predominantly difficult tasks. This paper devises a framework, HardEval, for assessing task difficulty for LLMs and crafting new tasks based on identified hard tasks. The framework uses a diverse array of prompts for a single task across multiple LLMs to obtain a difficulty score for each task of a benchmark. Using two code generation benchmarks, HumanEval+ and ClassEval, we show that HardEval can reliably identify the hard tasks within those benchmarks, highlighting that only 21% of HumanEval+ and 27% of ClassEval tasks are hard for LLMs. Through our analysis of task difficulty, we also characterize 6 practical hard task topics which we used to generate new hard tasks. Orthogonal to current benchmarking evaluation efforts, HardEval can assist researchers and practitioners in fostering better assessments of LLMs. The difficulty score can be used to identify hard tasks within existing benchmarks. This, in turn, can be leveraged to generate more hard tasks centred around specific topics either for evaluation or improvement of LLMs. HardEval generalistic approach can be applied to other domains such as code completion or Q/A.
翻译:大型语言模型(LLMs)在软件工程领域展现出巨大潜力,尤其在代码补全和代码生成等代码相关任务中。目前对LLMs的评估主要围绕基准测试的通用指标展开。虽然这些指标能从宏观层面反映基准测试和LLMs的能力,但基准测试中每个编程任务如何评估LLMs的具体能力尚不明确。特别是,基准测试中任务的难度水平并未体现在模型性能报告所使用的分数中。然而,在主要由简单任务构成的基准测试中获得90%分数的模型,其实际能力可能低于在主要由困难任务构成的基准测试中获得相同分数的模型。本文设计了一个名为HardEval的框架,用于评估LLMs的任务难度,并基于已识别的困难任务构建新任务。该框架通过为单个任务设计多样化提示词,在多个LLMs上运行以获得基准测试中每个任务的难度分数。使用HumanEval+和ClassEval两个代码生成基准测试,我们证明HardEval能可靠识别其中的困难任务,结果显示仅21%的HumanEval+任务和27%的ClassEval任务对LLMs而言属于困难任务。通过任务难度分析,我们归纳出6类实用的困难任务主题,并据此生成新的困难任务。与当前基准评估工作形成正交补充,HardEval可帮助研究者和从业者建立更完善的LLMs评估体系。难度分数可用于识别现有基准测试中的困难任务,进而围绕特定主题生成更多困难任务,用于LLMs的评估或改进。HardEval的通用方法可扩展至代码补全或问答等其他领域。