AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.
翻译:基于AI的代码生成器是一种新兴解决方案,通过深度神经网络(神经机器翻译,NMT)从自然语言描述中自动生成程序。具体而言,代码生成器已被用于生成概念验证攻击,以支持道德黑客和渗透安全测试。然而,代码生成器的评估仍面临若干问题。当前实践采用输出相似性指标,即通过计算生成代码与真实参考代码的文本相似性的自动评估指标。但尚不明确应使用何种指标,以及何种指标最适用于特定场景。本研究针对恶意代码生成器,系统分析了大量输出相似性指标。我们基于两个包含恶意汇编代码和Python代码及其英文描述的数据集,将评估指标应用于两种最先进的NMT模型。通过对比自动指标评估结果与人类评估结果,本文揭示了这些指标的优势与局限性,并提供了实践性见解。