The presence of software vulnerabilities is an ever-growing issue in software development. In most cases, it is desirable to detect vulnerabilities as early as possible, preferably in a just-in-time manner, when the vulnerable piece is added to the code base. The industry has a hard time combating this problem as manual inspection is costly and traditional means, such as rule-based bug detection, are not robust enough to follow the pace of the emergence of new vulnerabilities. The actively researched field of machine learning could help in such situations as models can be trained to detect vulnerable patterns. However, machine learning models work well only if the data is appropriately represented. In our work, we propose a novel way of representing changes in source code (i.e. code commits), the Code Change Tree, a form that is designed to keep only the differences between two abstract syntax trees of Java source code. We compared its effectiveness in predicting if a code change introduces a vulnerability against multiple representation types and evaluated them by a number of machine learning models as a baseline. The evaluation is done on a novel dataset that we published as part of our contributions using a 2-phase dataset generator method. Based on our evaluation we concluded that using Code Change Tree is a valid and effective choice to represent source code changes as it improves performance.
翻译:软件漏洞的存在是软件开发中日益严重的问题。大多数情况下,最好能尽早检测到漏洞,最好在脆弱代码片段添加到代码库时以即时方式发现。工业界在应对这一问题时面临困难,因为人工检查成本高昂,而传统方法(如基于规则的错误检测)不足以跟上新型漏洞出现的速度。机器学习这一活跃研究领域或可在此类情况下提供帮助,因为模型可被训练用于检测脆弱模式。然而,机器学习模型仅在数据得到合适表示时才能有效工作。在本文中,我们提出了一种新颖的源代码变更(即代码提交)表示方法——代码变更树,这种形式旨在仅保留Java源代码两个抽象语法树之间的差异。我们将其在预测代码变更是否引入漏洞方面的有效性与其他多种表示类型进行了比较,并基于多个机器学习模型作为基线进行了评估。该评估基于一个新颖的数据集,该数据集是我们贡献的一部分,采用两阶段数据集生成方法构建而成。基于我们的评估,我们得出结论:使用代码变更树是表示源代码变更的一种有效且高效的选择,因为它能提升性能。