Commit message generation (CMG) is a challenging task in automated software engineering that aims to generate natural language descriptions of code changes for commits. Previous methods all start from the modified code snippets, outputting commit messages through template-based, retrieval-based, or learning-based models. While these methods can summarize what is modified from the perspective of code, they struggle to provide reasons for the commit. The correlation between commits and issues that could be a critical factor for generating rational commit messages is still unexplored. In this work, we delve into the correlation between commits and issues from the perspective of dataset and methodology. We construct the first dataset anchored on combining correlated commits and issues. The dataset consists of an unlabeled commit-issue parallel part and a labeled part in which each example is provided with human-annotated rational information in the issue. Furthermore, we propose \tool (\underline{Ex}traction, \underline{Gro}unding, \underline{Fi}ne-tuning), a novel paradigm that can introduce the correlation between commits and issues into the training phase of models. To evaluate whether it is effective, we perform comprehensive experiments with various state-of-the-art CMG models. The results show that compared with the original models, the performance of \tool-enhanced models is significantly improved.
翻译:提交消息生成(CMG)是自动化软件工程中的一项具有挑战性的任务,旨在为提交的代码变更生成自然语言描述。以往的方法均从修改的代码片段出发,通过基于模板、基于检索或基于学习模型输出提交消息。这些方法虽能从代码角度总结修改内容,但难以提供提交的原因。提交与问题之间的相关性——这一可能成为生成合理提交消息的关键因素——仍未被探索。本研究从数据集和方法论两个维度深入探究提交与问题的相关性。我们构建了首个锚定于关联提交与问题的数据集,该数据集包含无标签的提交-问题平行部分以及带标签部分,其中每个示例均附带人工标注的问题中合理信息。此外,我们提出\tool(\underline{Ext}raction, \underline{Gro}unding, \underline{Fi}ne-tuning)这一创新范式,可将提交与问题的相关性引入模型训练阶段。为评估其有效性,我们使用多种最先进的CMG模型进行了全面实验。结果表明,与原模型相比,经\tool增强的模型性能显著提升。