Background: Software Vulnerability (SV) prediction in emerging languages is increasingly important to ensure software security in modern systems. However, these languages usually have limited SV data for developing high-performing prediction models. Aims: We conduct an empirical study to evaluate the impact of SV data scarcity in emerging languages on the state-of-the-art SV prediction model and investigate potential solutions to enhance the performance. Method: We train and test the state-of-the-art model based on CodeBERT with and without data sampling techniques for function-level and line-level SV prediction in three low-resource languages - Kotlin, Swift, and Rust. We also assess the effectiveness of ChatGPT for low-resource SV prediction given its recent success in other domains. Results: Compared to the original work in C/C++ with large data, CodeBERT's performance of function-level and line-level SV prediction significantly declines in low-resource languages, signifying the negative impact of data scarcity. Regarding remediation, data sampling techniques fail to improve CodeBERT; whereas, ChatGPT showcases promising results, substantially enhancing predictive performance by up to 34.4% for the function level and up to 53.5% for the line level. Conclusion: We have highlighted the challenge and made the first promising step for low-resource SV prediction, paving the way for future research in this direction.
翻译:背景:新兴语言中的软件漏洞(Software Vulnerability, SV)预测对于保障现代系统的软件安全性日益重要。然而,这些语言通常缺乏高质量的SV数据来开发高性能预测模型。目的:我们开展实证研究,评估新兴语言中SV数据稀缺性对现有最优SV预测模型的影响,并探究提升性能的潜在解决方案。方法:我们基于CodeBERT训练并测试了有无数据采样技术的最优模型,针对三种低资源语言(Kotlin、Swift和Rust)进行函数级和行级SV预测。同时,鉴于ChatGPT在其他领域的近期成功,我们评估了其在低资源SV预测中的有效性。结果:与拥有大量数据的C/C++原始研究相比,CodeBERT在低资源语言中的函数级和行级SV预测性能显著下降,表明数据稀缺性的负面影响。在修复措施方面,数据采样技术未能提升CodeBERT的性能;而ChatGPT展现出令人鼓舞的结果,将函数级预测性能提升高达34.4%,行级预测性能提升高达53.5%。结论:我们揭示了低资源SV预测的挑战,并迈出了首个具有前景的研究步骤,为该方向的未来研究奠定了基础。