Python has become the most popular programming language as it is friendly to work with for beginners. However, a recent study has found that most security issues in Python have not been indexed by CVE and may only be fixed by 'silent' security commits, which pose a threat to software security and hinder the security fixes to downstream software. It is critical to identify the hidden security commits; however, the existing datasets and methods are insufficient for security commit detection in Python, due to the limited data variety, non-comprehensive code semantics, and uninterpretable learned features. In this paper, we construct the first security commit dataset in Python, namely PySecDB, which consists of three subsets including a base dataset, a pilot dataset, and an augmented dataset. The base dataset contains the security commits associated with CVE records provided by MITRE. To increase the variety of security commits, we build the pilot dataset from GitHub by filtering keywords within the commit messages. Since not all commits provide commit messages, we further construct the augmented dataset by understanding the semantics of code changes. To build the augmented dataset, we propose a new graph representation named CommitCPG and a multi-attributed graph learning model named SCOPY to identify the security commit candidates through both sequential and structural code semantics. The evaluation shows our proposed algorithms can improve the data collection efficiency by up to 40 percentage points. After manual verification by three security experts, PySecDB consists of 1,258 security commits and 2,791 non-security commits. Furthermore, we conduct an extensive case study on PySecDB and discover four common security fix patterns that cover over 85% of security commits in Python, providing insight into secure software maintenance, vulnerability detection, and automated program repair.
翻译:Python已成为最受欢迎的编程语言,因其对初学者友好而广受使用。然而,近期研究发现,Python中的大多数安全问题未被CVE收录,且可能仅通过“静默”安全提交修复,这对软件安全构成威胁,并阻碍了下游软件的安全修复。识别隐藏的安全提交至关重要;然而,现有数据集和方法因数据多样性有限、代码语义不全面以及学习特征不可解释,在Python安全提交检测方面存在不足。本文构建了首个Python安全提交数据集PySecDB,包含三个子集:基础数据集、试点数据集和增强数据集。基础数据集包含与MITRE提供的CVE记录相关联的安全提交。为增加安全提交的多样性,我们通过过滤GitHub中提交消息内的关键词构建了试点数据集。由于并非所有提交都提供提交消息,我们进一步通过理解代码变更语义构建了增强数据集。为构建增强数据集,我们提出了新的图表示CommitCPG以及名为SCOPY的多属性图学习模型,通过顺序和结构代码语义识别安全提交候选。评估表明,所提算法可将数据收集效率提升高达40个百分点。经三位安全专家人工验证,PySecDB包含1,258个安全提交和2,791个非安全提交。此外,我们对PySecDB进行了广泛案例研究,发现了覆盖Python中超过85%安全提交的四种常见安全修复模式,为安全软件维护、漏洞检测和自动化程序修复提供了见解。