Large Language Models (LLMs) are used for many tasks, including those related to coding. An important aspect of being able to utilize LLMs is the ability to assess their fitness for specific usages. The common practice is to evaluate LLMs against a set of benchmarks. While benchmarks provide a sound foundation for evaluation and comparison of alternatives, they suffer from the well-known weakness of leaking into the training data \cite{Xu2024Benchmarking}. We present a method for creating benchmark variations that generalize across coding tasks and programming languages, and may also be applied to in-house code bases. Our approach enables ongoing generation of test-data thus mitigating the leaking into the training data issue. We implement one benchmark, called \textit{auto-regression}, for the task of text-to-code generation in Python. Auto-regression is specifically created to aid in debugging and in tracking model generation changes as part of the LLM regression testing process.
翻译:大型语言模型(LLM)被用于包括编码在内的诸多任务。有效利用LLM的关键在于能够评估其针对特定用途的适用性。通常的做法是通过一系列基准测试来评估LLM。尽管基准测试为评估和比较不同方案提供了可靠基础,但它们存在众所周知的缺陷——测试数据可能泄露至训练数据中 \cite{Xu2024Benchmarking}。本文提出一种创建基准测试变体的方法,该方法可泛化至不同编码任务与编程语言,并适用于内部代码库。我们的方法能够持续生成测试数据,从而缓解数据泄露至训练集的问题。我们针对Python文本到代码生成任务实现了一个名为\textit{auto-regression}的基准测试。该测试专门设计用于辅助调试,并作为LLM回归测试流程的一部分,追踪模型生成结果的变化。