The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the black-box nature of LLM training data which makes it difficult to even know if data leakage has occurred. To address the data leakage problem, we propose a new benchmark construction method where a benchmark is composed of template tasks that can be instantiated into new concrete tasks using combinatorial test design. Concrete tasks for the same template task must be different enough that data leakage has minimal impact and similar enough that the tasks are interchangeable with respect to performance evaluation. To assess our benchmark construction method, we propose HumanEval_T, an alternative benchmark to HumanEval that was constructed using template tasks and combinatorial test design.
翻译:大型语言模型(LLM)的应用已广泛渗透至众多领域,包括软件工程,其中LLM已被用于自动化程序生成与测试分类等任务。随着基于LLM的方法持续演进,确立清晰且稳健的公平性能评估方法至关重要。基准测试是评估LLM解决特定问题任务能力以及比较不同版本LLM随时间任务解决表现的常用手段。例如,HumanEval基准由164项人工构建的任务组成,已成为评估基于LLM的程序生成的重要工具。然而,使用HumanEval等基准公平评估LLM面临一个主要障碍:基准任务及其解答因数据泄露而污染训练数据集。这一障碍因LLM训练数据的黑盒特性而加剧——我们甚至难以判断数据泄露是否发生。为解决数据泄露问题,我们提出一种新的基准构建方法:通过组合测试设计,将模板任务实例化为新的具体任务,从而构成基准。同一模板任务衍生的具体任务必须满足:差异足够大以使数据泄露的影响最小化,同时相似度足够高以保证任务在性能评估层面可互换。为验证我们的基准构建方法,我们提出了HumanEval_T——一个使用模板任务与组合测试设计构建的、可替代HumanEval的基准。