In the face of growing vulnerabilities found in open-source software, the need to identify {discreet} security patches has become paramount. The lack of consistency in how software providers handle maintenance often leads to the release of security patches without comprehensive advisories, leaving users vulnerable to unaddressed security risks. To address this pressing issue, we introduce a novel security patch detection system, LLMDA, which capitalizes on Large Language Models (LLMs) and code-text alignment methodologies for patch review, data enhancement, and feature combination. Within LLMDA, we initially utilize LLMs for examining patches and expanding data of PatchDB and SPI-DB, two security patch datasets from recent literature. We then use labeled instructions to direct our LLMDA, differentiating patches based on security relevance. Following this, we apply a PTFormer to merge patches with code, formulating hybrid attributes that encompass both the innate details and the interconnections between the patches and the code. This distinctive combination method allows our system to capture more insights from the combined context of patches and code, hence improving detection precision. Finally, we devise a probabilistic batch contrastive learning mechanism within batches to augment the capability of the our LLMDA in discerning security patches. The results reveal that LLMDA significantly surpasses the start of the art techniques in detecting security patches, underscoring its promise in fortifying software maintenance.
翻译:面对开源软件中日益增多的漏洞,识别隐蔽安全补丁的需求变得至关重要。软件供应商在处理维护时缺乏一致性,常导致安全补丁在未发布全面公告的情况下被部署,使用户暴露于未解决的安全风险中。为解决这一紧迫问题,我们提出了一种新型安全补丁检测系统LLMDA,该系统利用大语言模型(LLMs)和代码-文本对齐方法进行补丁审查、数据增强与特征组合。在LLMDA中,我们首先利用LLMs审查补丁并扩展两个来自近期文献的安全补丁数据集PatchDB和SPI-DB的数据。随后,通过标注指令引导LLMDA,根据安全相关性区分补丁。接着应用PTFormer(补丁-代码Transformer)融合补丁与代码,构建包含补丁与代码固有细节及两者关联的混合属性。这种独特的组合方法使系统能从补丁与代码的联合语境中捕获更多信息,从而提升检测精度。最后,我们在批次内设计概率性对比学习机制,增强LLMDA对安全补丁的辨别能力。实验结果表明,LLMDA在检测安全补丁方面显著超越现有最优技术,凸显其在强化软件维护中的潜力。