Benchmarking Causal Study to Interpret Large Language Models for Source Code

One of the most common solutions adopted by software researchers to address code generation is by training Large Language Models (LLMs) on massive amounts of source code. Although a number of studies have shown that LLMs have been effectively evaluated on popular accuracy metrics (e.g., BLEU, CodeBleu), previous research has largely overlooked the role of Causal Inference as a fundamental component of the interpretability of LLMs' performance. Existing benchmarks and datasets are meant to highlight the difference between the expected and the generated outcome, but do not take into account confounding variables (e.g., lines of code, prompt size) that equally influence the accuracy metrics. The fact remains that, when dealing with generative software tasks by LLMs, no benchmark is available to tell researchers how to quantify neither the causal effect of SE-based treatments nor the correlation of confounders to the model's performance. In an effort to bring statistical rigor to the evaluation of LLMs, this paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks (i.e., code completion, code summarization, and commit generation) to help aid the interpretation of LLMs' performance. We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods. The results of the case study demonstrate the positive causal influence of prompt semantics on ChatGPT's generative performance by an average treatment effect of $\approx 3\%$. Moreover, it was found that confounders such as prompt size are highly correlated with accuracy metrics ($\approx 0.412\%$). The end result of our case study is to showcase causal inference evaluations, in practice, to reduce confounding bias. By reducing the bias, we offer an interpretable solution for the accuracy metric under analysis.

翻译：软件研究者解决代码生成的常见方法之一，是在海量源代码上训练大语言模型（LLMs）。尽管已有研究表明，LLMs在流行准确率指标（如BLEU、CodeBleu）上得到了有效评估，但先前研究大多忽视了因果推断作为LLMs性能可解释性基本组成部分的作用。现有基准测试与数据集旨在突出预期输出与生成输出之间的差异，但未考虑同等影响准确率指标的混杂变量（如代码行数、提示尺寸）。事实上，在处理LLMs的生成式软件任务时，尚无可用基准能告知研究者如何量化基于软件工程处理的因果效应，以及混杂因素与模型性能的相关性。为将统计严谨性引入LLMs评估，本文提出名为Galeras的基准策略，包含针对三项软件工程任务（代码补全、代码摘要、提交生成）的精选测试平台，以辅助解释LLMs性能。通过开展不同提示工程方法下ChatGPT性能的案例研究，我们阐释了该基准策略的洞见。案例研究结果表明，提示语义对ChatGPT生成性能具有正向因果影响，平均处理效应约为$\approx 3\%$。此外，发现提示尺寸等混杂因素与准确率指标高度相关（$\approx 0.412\%$）。本案例研究的最终目的是展示实际中的因果推断评估方法，以减少混杂偏差。通过降低偏差，我们为所分析的准确率指标提供了可解释的解决方案。