Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks.
翻译:代码覆盖率是一种广泛使用的指标,用于量化测试过程中程序元素(如语句或分支)被执行的程度。计算代码覆盖率资源消耗大,需要构建和运行代码,并额外增加代码插桩的开销。此外,计算任意代码片段的覆盖率需要整个程序上下文。利用机器学习来分担这一昂贵过程,可以通过仅需源代码上下文来降低代码覆盖率的计算成本,而代码覆盖率预测任务可成为评估模型理解代码能力的新基准。我们提出了一项名为“大型语言模型的代码覆盖率预测”的新基准任务。我们形式化了该任务,通过判断给定测试用例和输入所执行的方法代码行数,来评估大型语言模型理解代码执行的能力。我们通过执行HumanEval数据集中的测试用例和代码并收集代码覆盖率信息,整理并发布了名为COVERAGEEVAL的数据集。我们报告了四种最先进的大型语言模型在代码覆盖率预测任务上的表现,包括OpenAI的GPT-4和GPT-3.5-Turbo、Google的BARD以及Anthropic的Claude。最后,我们论证了代码覆盖率作为指标和预训练数据源,对大型语言模型在软件工程任务中的整体性能具有重要价值。