Vulnerability code-bases often suffer from severe imbalance, limiting the effectiveness of Deep Learning-based vulnerability classifiers. Data Augmentation could help solve this by mitigating the scarcity of under-represented vulnerability types. In this context, we investigate LLM-based augmentation for vulnerable functions, comparing controlled generation of new vulnerable samples with semantics-preserving refactoring of existing ones. Using Qwen2.5-Coder to produce augmented data and CodeBERT as a classifier on the SVEN dataset, we find that our approaches are indeed effective in enriching vulnerable code-bases through a simple process and with reasonable quality, and that a hybrid strategy best boosts vulnerability classifiers' performance. Code repository is available here : https://github.com/DynaSoumhaneOuchebara/LLM-based-code-augmentation-Generate-or-Refactor-
翻译:漏洞代码库常面临严重的类别不平衡问题,这限制了基于深度学习的漏洞分类器的有效性。数据增强可通过缓解少数漏洞类型的样本稀缺性来帮助解决此问题。在此背景下,我们研究了基于大语言模型(LLM)的漏洞函数增强方法,对比了受控生成新漏洞样本与对现有样本进行语义保持重构两种策略。通过使用Qwen2.5-Coder生成增强数据,并以CodeBERT作为分类器在SVEN数据集上进行实验,我们发现所提方法能通过简洁流程并以合理质量有效丰富漏洞代码库,且混合增强策略最能提升漏洞分类器的性能。代码仓库地址:https://github.com/DynaSoumhaneOuchebara/LLM-based-code-augmentation-Generate-or-Refactor-