Commit messages are natural language descriptions of code changes, which are important for software evolution such as code understanding and maintenance. However, previous methods are trained on the entire dataset without considering the fact that a portion of commit messages adhere to good practice (i.e., good-practice commits), while the rest do not. On the basis of our empirical study, we discover that training on good-practice commits significantly contributes to the commit message generation. Motivated by this finding, we propose a novel knowledge-aware denoising learning method called KADEL. Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits. To achieve this, we propose a model that learns the commit knowledge by training on good-practice commits. This knowledge model enables supplementing more information for training samples that do not conform to good practice. However, since the supplementary information may contain noise or prediction errors, we propose a dynamic denoising training method. This method composes a distribution-aware confidence function and a dynamic distribution list, which enhances the effectiveness of the training process. Experimental results on the whole MCMD dataset demonstrate that our method overall achieves state-of-the-art performance compared with previous methods. Our source code and data are available at https://github.com/DeepSoftwareAnalytics/KADEL
翻译:提交信息是代码变更的自然语言描述,对于代码理解与维护等软件演化过程至关重要。然而,现有方法均基于完整数据集进行训练,并未考虑到部分提交信息遵循良好实践(即良好实践提交),而其余提交信息则不符合这一规范。基于实证研究,我们发现针对良好实践提交的训练对生成提交信息具有显著促进作用。受此发现启发,我们提出了一种名为KADEL的新颖知识感知去噪学习方法。考虑到良好实践提交仅占数据集很小比例,我们将剩余训练样本与这些良好实践提交进行对齐。为此,我们提出通过训练良好实践提交来学习提交知识的模型。该知识模型能够为不符合良好实践的训练样本补充更多信息。但由于补充信息可能包含噪声或预测误差,我们又提出动态去噪训练方法。该方法由分布感知置信度函数与动态分布列表组成,从而提升训练过程的有效性。在完整MCMD数据集上的实验结果表明,与现有方法相比,我们的方法整体上达到了当前最优性能。源代码与数据请访问 https://github.com/DeepSoftwareAnalytics/KADEL