In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI. Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with human evaluation (Pearson's correlation coefficient r=0.84 on average). Finally, since it is a fully automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ~0.17s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.
翻译:本文提出一种名为ACCA的全自动方法,用于评估面向安全目的的AI生成代码的正确性。该方法采用符号执行技术,判断AI生成代码是否与参考实现的行为一致。我们利用ACCA评估了四个为生成安全导向汇编代码而训练的最先进模型,并将评估结果与多种基线方案进行对比,包括该领域广泛使用的输出相似性度量指标,以及OpenAI开发的知名AI语言模型ChatGPT。实验表明,ACCA方法优于基线方案,其评估AI生成代码正确性的能力与人类评估(该领域公认的评估基准)相当。此外,ACCA与人类评估具有极强的相关性(皮尔逊相关系数平均r=0.84)。最终,由于该方法为全自动方案且无需任何人工干预,其能以平均每秒评估约0.17个代码片段的速度执行评估,根据我们的经验,这显著低于人工分析师手动检查代码所需的平均时间。