Security bug reports require prompt identification to minimize the window of vulnerability in software systems. Traditional machine learning (ML) techniques for classifying bug reports to identify security bug reports rely heavily on large amounts of labeled data. However, datasets for security bug reports are often scarce in practice, leading to poor model performance and limited applicability in real-world settings. In this study, we propose a few-shot learning-based technique to effectively identify security bug reports using limited labeled data. We employ SetFit, a state-of-the-art few-shot learning framework that combines sentence transformers with contrastive learning and parameter-efficient fine-tuning. The model is trained on a small labeled dataset of bug reports and is evaluated on its ability to classify these reports as either security-related or non-security-related. Our approach achieves an AUC of 0.865, at best, outperforming traditional ML techniques (baselines) for all of the evaluated datasets. This highlights the potential of SetFit to effectively identify security bug reports. SetFit-based few-shot learning offers a promising alternative to traditional ML techniques to identify security bug reports. The approach enables efficient model development with minimal annotation effort, making it highly suitable for scenarios where labeled data is scarce.
翻译:安全缺陷报告需要及时识别,以最小化软件系统中的漏洞暴露窗口。用于分类缺陷报告以识别安全缺陷报告的传统机器学习技术严重依赖大量标注数据。然而,安全缺陷报告的数据集在实践中往往稀缺,导致模型性能不佳且在实际场景中的适用性有限。在本研究中,我们提出一种基于小样本学习的技术,利用有限的标注数据有效识别安全缺陷报告。我们采用SetFit——一种结合句子Transformer、对比学习和参数高效微调的先进小样本学习框架。该模型在少量标注的缺陷报告数据集上进行训练,并评估其将报告分类为安全相关或非安全相关的能力。我们的方法最佳AUC达到0.865,在所有评估数据集上均优于传统机器学习技术。这凸显了SetFit有效识别安全缺陷报告的潜力。基于SetFit的小样本学习为识别安全缺陷报告提供了一种有前景的传统机器学习替代方案。该方法能以最少的标注工作实现高效的模型开发,使其非常适合标注数据稀缺的场景。