In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass/fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. In this work, we focused on enhancing LLM-as-a-Judge using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement for Bash code generation to select the best model for automatic incident remediation in IT Automation. We used execution-based evaluation as ground-truth to evaluate our LLM-as-a-Judge metrics. Results show high accuracy and agreement with execution-based evaluation (and up to 8% over baseline). Finally, we built Reflection code agents to utilize judgments and feedback from our evaluation metrics which achieved significant improvement (up to 24% increase in accuracy) for automatic code refinement.
翻译:为了在IT自动化中自动评估并选择最佳模型,并提升自动事件修复的代码质量,验证生成的修复动作代码在语法和语义上是否正确、以及能否按预期正确执行至关重要。现有三种主要方法:1)传统方法使用表面形式相似度度量(如词元匹配、精确匹配等),这些方法存在诸多局限;2)基于执行的评估更侧重于代码功能,依据给定测试用例的通过/失败进行判断;3)LLM-as-a-Judge利用大型语言模型进行自动评估,根据预定义指标判断其是否为给定问题的正确答案。本研究聚焦于通过双向功能匹配与逻辑表示来增强LLM-as-a-Judge方法,以实现对Bash代码生成的参考无关自动验证与精化,从而为IT自动化中的自动事件修复任务选择最佳模型。我们采用基于执行的评估作为基准真值,对所提出的LLM-as-a-Judge度量指标进行验证。实验结果表明,该方法具有高准确性,且与基于执行的评估结果高度一致(最高超过基线8%)。最后,我们构建了反思式代码智能体,利用评估指标产生的判断与反馈进行自动代码精化,取得了显著的效果提升(准确率最高提升24%)。