Given recent advancement of Large Language Models (LLMs), the task of translating from natural language prompts to different programming languages (code generation) attracts immense attention for wide application in different domains. Specially code generation for Bash (NL2Bash) is widely used to generate Bash scripts for automating different tasks, such as performance monitoring, compilation, system administration, system diagnostics, etc. Besides code generation, validating synthetic code is critical before using them for any application. Different methods for code validation are proposed, both direct (execution evaluation) and indirect validations (i.e. exact/partial match, BLEU score). Among these, Execution-based Evaluation (EE) can validate the predicted code by comparing the execution output of model prediction and expected output in system. However, designing and implementing such an execution-based evaluation system for NL2Bash is not a trivial task. In this paper, we present a machinery for execution-based evaluation for NL2Bash. We create a set of 50 prompts to evaluate some popular LLMs for NL2Bash. We also analyze several advantages and challenges of EE such as syntactically different yet semantically equivalent Bash scripts generated by different LLMs, or syntactically correct but semantically incorrect Bash scripts, and how we capture and process them correctly.
翻译:随着大型语言模型(LLMs)的最新进展,将自然语言提示翻译成不同编程语言(代码生成)的任务因其在不同领域的广泛应用而备受关注。特别是面向Bash的代码生成(NL2Bash)被广泛用于生成Bash脚本,以实现性能监控、编译、系统管理、系统诊断等各类任务的自动化。除代码生成外,合成代码在使用前进行验证至关重要。目前已提出多种代码验证方法,包括直接验证(执行评估)和间接验证(如精确/部分匹配、BLEU评分)。在这些方法中,基于执行的评估(EE)可通过比较模型预测的执行输出与系统中预期输出来验证预测代码。然而,为NL2Bash设计和实现此类基于执行的评估系统并非易事。本文提出了一套用于NL2Bash基于执行评估的机制。我们创建了包含50个提示的测试集,以评估若干主流LLM在NL2Bash任务中的表现。同时分析了EE的多项优势与挑战,例如不同LLM生成的语法不同但语义等价的Bash脚本,或语法正确但语义错误的Bash脚本,并探讨了如何正确捕获与处理这些问题。