Commit message generation (CMG) is a challenging task in automated software engineering that aims to generate natural language descriptions of code changes for commits. Previous methods all start from the modified code snippets, outputting commit messages through template-based, retrieval-based, or learning-based models. While these methods can summarize what is modified from the perspective of code, they struggle to provide reasons for the commit. The correlation between commits and issues that could be a critical factor for generating rational commit messages is still unexplored. In this work, we delve into the correlation between commits and issues from the perspective of dataset and methodology. We construct the first dataset anchored on combining correlated commits and issues. The dataset consists of an unlabeled commit-issue parallel part and a labeled part in which each example is provided with human-annotated rational information in the issue. Furthermore, we propose \tool (\underline{Ex}traction, \underline{Gro}unding, \underline{Fi}ne-tuning), a novel paradigm that can introduce the correlation between commits and issues into the training phase of models. To evaluate whether it is effective, we perform comprehensive experiments with various state-of-the-art CMG models. The results show that compared with the original models, the performance of \tool-enhanced models is significantly improved.
翻译:提交消息生成(CMG)是自动化软件工程中的一项挑战性任务,旨在为代码变更的提交生成自然语言描述。以往的方法均从修改的代码片段出发,通过基于模板、检索或学习的模型输出提交消息。虽然这些方法能够从代码角度概括修改内容,但难以解释提交的原因。提交与问题之间的关联——这一生成合理提交消息的关键因素——尚未得到探索。在本工作中,我们从数据集和方法论两个维度深入探究提交与问题的关联。我们构建了首个基于关联提交与问题的数据集,包含未标注的提交-问题平行部分和标注部分,后者中的每个示例均在问题中附有人工标注的理性信息。此外,我们提出\tool(\underline{提取}、\underline{锚定}、\underline{微调}),一种将提交与问题关联引入模型训练阶段的新范式。为评估其有效性,我们使用多种最先进的CMG模型进行了全面实验。结果表明,与原始模型相比,由\tool增强的模型性能显著提升。