In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.
翻译:本文解决模型评估中的一个关键挑战:当模型在训练过程中可能已经见过基准测试数据时,如何保持代码基准测试的有效性。我们提出了一种新颖的解决方案——动态基准测试框架,以应对这一挑战。给定一个代码理解或推理基准测试,我们的框架通过多种语义保持的变异方法动态转换每个输入(即程序),从而构建一个语法全新但语义完全相同的基准测试。我们在动态基准测试上评估了十种主流语言模型。评估结果揭示了若干有趣或令人惊讶的发现:(1)所有模型的表现均显著下降;(2)部分模型间的排名发生剧烈变化;(3)我们的动态基准测试能够有效抵御数据污染问题。