Context: Traditional software security analysis methods struggle to keep pace with the scale and complexity of modern codebases, requiring intelligent automation to detect, assess, and remediate vulnerabilities more efficiently and accurately. Objective: This paper explores the incorporation of code-specific and general-purpose Large Language Models (LLMs) to automate critical software security tasks, such as identifying vulnerabilities, predicting severity and access complexity, and generating fixes as a proof of concept. Method: We evaluate five pairs of recent LLMs, including both code-based and general-purpose open-source models, on two recognized C/C++ vulnerability datasets, namely Big-Vul and Vul-Repair. Additionally, we compare fine-tuning and prompt-based approaches. Results: The results show that fine-tuning uniformly outperforms both zero-shot and few-shot approaches across all tasks and models. Notably, code-specialized models excel in zero-shot and few-shot settings on complex tasks, while general-purpose models remain nearly as effective. Discrepancies among CodeBLEU, CodeBERTScore, BLEU, and ChrF highlight the inadequacy of current metrics for measuring repair quality. Conclusions: This study contributes to the software security community by investigating the potential of advanced LLMs to improve vulnerability analysis and remediation.
翻译:背景:传统的软件安全分析方法难以跟上现代代码库的规模和复杂性,需要智能自动化来更高效、更准确地检测、评估和修复漏洞。目标:本文探讨了如何结合代码专用和通用大语言模型(LLMs),以自动化关键的软件安全任务,例如识别漏洞、预测严重性和访问复杂性以及生成修复方案,以此作为概念验证。方法:我们在两个公认的C/C++漏洞数据集(即Big-Vul和Vul-Repair)上评估了五对近期的大语言模型,包括基于代码的模型和通用开源模型。此外,我们还比较了微调和基于提示的方法。结果:结果表明,在所有任务和模型中,微调方法均优于零样本和少样本方法。值得注意的是,代码专用模型在复杂任务的零样本和少样本设置中表现优异,而通用模型的效果也几乎相当。CodeBLEU、CodeBERTScore、BLEU和ChrF等指标之间的差异突显了当前度量标准在衡量修复质量方面的不足。结论:本研究通过探索先进大语言模型在改进漏洞分析和修复方面的潜力,为软件安全社区做出了贡献。