In this work, we present a method to add perturbations to the code descriptions to create new inputs in natural language (NL) from well-intentioned developers that diverge from the original ones due to the use of new words or because they miss part of them. The goal is to analyze how and to what extent perturbations affect the performance of AI code generators in the context of security-oriented code. First, we show that perturbed descriptions preserve the semantics of the original, non-perturbed ones. Then, we use the method to assess the robustness of three state-of-the-art code generators against the newly perturbed inputs, showing that the performance of these AI-based solutions is highly affected by perturbations in the NL descriptions. To enhance their robustness, we use the method to perform data augmentation, i.e., to increase the variability and diversity of the NL descriptions in the training data, proving its effectiveness against both perturbed and non-perturbed code descriptions.
翻译:在本文中,我们提出了一种方法,通过向代码描述添加扰动,从善意开发者那里生成新的自然语言(NL)输入,这些输入因使用新词或缺失部分内容而与原始输入存在差异。我们的目标是分析扰动在何种程度上影响安全导向代码场景下AI代码生成器的性能。首先,我们证明扰动后的描述能够保留原始非扰动描述的语义。随后,我们利用该方法评估三种最先进的代码生成器在新扰动输入上的鲁棒性,结果表明这些基于AI的解决方案的性能高度受NL描述扰动的影响。为了增强其鲁棒性,我们将该方法用于数据增强,即提升训练数据中NL描述的变异性和多样性,并验证了该方法在应对扰动与非扰动代码描述时的有效性。