This paper presents an automated method for classifying source code changes during the software development process based on clustering of change metrics. The method consists of two steps: clustering of metric vectors computed for each code change, followed by expert mapping of the resulting clusters to predefined change classes. The distribution of changes into clusters is performed automatically, while the mapping of clusters to classes is carried out by an expert. Automation of the distribution step substantially reduces the time required for code change review. The k-means algorithm with a cosine similarity measure between metric vectors is used for clustering. Eleven source code metrics are employed, covering lines of code, cyclomatic complexity, file counts, interface changes, and structural changes. The method was validated on five software systems, including two open-source projects (Subversion and NHibernate), and demonstrated classification purity of P_C = 0.75 +/- 0.05 and entropy of E_C = 0.37 +/- 0.06 at a significance level of 0.05.
翻译:本文提出一种基于变更度量聚类的软件开发过程中源代码变更自动分类方法。该方法包含两个步骤:首先对每个代码变更计算得到的度量向量进行聚类,随后由专家将生成的聚类映射至预定义的变更类别。变更在聚类间的分配过程自动执行,而聚类到类别的映射则由专家完成。分配步骤的自动化显著减少了代码变更审查所需时间。聚类过程采用k-means算法,以度量向量间的余弦相似度作为度量标准。该方法采用十一项源代码度量指标,涵盖代码行数、圈复杂度、文件数量、接口变更及结构变更等方面。通过在五个软件系统(包括Subversion和NHibernate两个开源项目)上的验证,本方法在0.05显著性水平下实现了分类纯度P_C = 0.75 +/- 0.05与熵值E_C = 0.37 +/- 0.06。