Automating the Correctness Assessment of AI-generated Code for Security Contexts

Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI. Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with the human evaluation (Pearson's correlation coefficient r=0.84 on average). Finally, since it is a fully automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ~0.17s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.

翻译：评估AI生成代码的正确性是一项具有挑战性的开放性问题。本文提出了一种名为ACCA的全自动方法，用于评估面向安全用途的AI生成代码的正确性。该方法采用符号执行技术来判定AI生成代码是否与参考实现的行为一致。我们利用ACCA评估了四个用于生成面向安全汇编代码的最先进模型，并将评估结果与多种基线方案进行了对比，包括该领域广泛使用的输出相似度指标以及著名的OpenAI开发的AI语言模型ChatGPT。实验表明，我们的方法优于基线方案，其评估AI生成代码正确性的表现与人工评估（该领域公认的评估标准）相当。此外，ACCA与人工评估结果具有极强的相关性（平均皮尔逊相关系数r=0.84）。最后，作为无需任何人工干预的全自动解决方案，该方法平均每段代码的评估时间约为0.17秒，根据我们的经验，这远低于人工分析师手动审查代码所需的平均时间。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日