Given recent advancements of Large Language Models (LLMs), code generation tasks attract immense attention for wide application in different domains. In an effort to evaluate and select a best model to automatically remediate system incidents discovered by Application Performance Monitoring (APM) platforms, it is crucial to verify if the generated code is syntactically and semantically correct, and whether it can be executed correctly as intended. However, current methods for evaluating the quality of code generated by LLMs heavily rely on surface form similarity metrics (e.g. BLEU, ROUGE, and exact/partial match) which have numerous limitations. In contrast, execution based evaluation focuses more on code functionality and does not constrain the code generation to any fixed solution. Nevertheless, designing and implementing such execution-based evaluation platform is not a trivial task. There are several works creating execution-based evaluation platforms for popular programming languages such as SQL, Python, Java, but limited or no attempts for scripting languages such as Bash and PowerShell. In this paper, we present the first execution-based evaluation platform in which we created three test suites (total 125 handcrafted test cases) to evaluate Bash (both single-line commands and multiple-line scripts) and PowerShell codes generated by LLMs. We benchmark seven closed and open-source LLMs using our platform with different techniques (zero-shot vs. few-shot learning).
翻译:随着大型语言模型(LLM)的最新进展,代码生成任务因其在不同领域的广泛应用而受到极大关注。为评估并选择最佳模型以自动修复应用性能监控(APM)平台发现的系统事件,必须验证生成代码在语法和语义上的正确性,以及其能否按预期正确执行。然而,当前评估LLM生成代码质量的方法严重依赖表面形式相似度指标(如BLEU、ROUGE及精确/部分匹配),这些方法存在诸多局限。相比之下,基于执行的评估更关注代码功能性,且不将代码生成限制于任何固定解决方案。尽管如此,设计与实现此类基于执行的评估平台并非易事。现有研究已为SQL、Python、Java等流行编程语言构建了基于执行的评估平台,但针对Bash和PowerShell等脚本语言的尝试仍十分有限甚至空白。本文提出了首个基于执行的评估平台,其中构建了三个测试集(共125个手工设计的测试用例),用于评估LLM生成的Bash(包括单行命令与多行脚本)及PowerShell代码。我们使用该平台结合不同技术(零样本学习与少样本学习)对七个闭源与开源LLM进行了基准测试。