Software, while beneficial, poses potential cybersecurity risks due to inherent vulnerabilities. Detecting these vulnerabilities is crucial, and deep learning has shown promise as an effective tool for this task due to its ability to perform well without extensive feature engineering. However, a challenge in deploying deep learning for vulnerability detection is the limited availability of training data. Recent research highlights the deep learning efficacy in diverse tasks. This success is attributed to instruction fine-tuning, a technique that remains under-explored in the context of vulnerability detection. This paper investigates the capability of models, specifically a recent language model, to generalize beyond the programming languages used in their training data. It also examines the role of natural language instructions in enhancing this generalization. Our study evaluates the model performance on a real-world dataset to predict vulnerable code. We present key insights and lessons learned, contributing to understanding the deep learning application in software vulnerability detection.
翻译:软件虽有益处,但由于其固有的漏洞,可能带来网络安全风险。检测这些漏洞至关重要,而深度学习因其无需大量特征工程即可表现良好的能力,已成为完成此任务的有效工具。然而,在漏洞检测中部署深度学习的挑战之一是训练数据的可用性有限。近期研究强调了深度学习在多种任务中的有效性,这一成功归因于指令微调这一技术在漏洞检测背景下仍未被充分探索。本文研究了模型(特别是近期语言模型)在泛化至训练数据所用编程语言之外的能力,并考察了自然语言指令在增强这种泛化中的作用。我们的研究基于真实世界数据集评估了模型预测漏洞代码的性能,提出了关键见解与经验教训,为理解深度学习在软件漏洞检测中的应用做出了贡献。