Commit message generation (CMG) is a crucial task in software engineering that is challenging to evaluate correctly. When a CMG system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers operating within controlled research environments. We release all the code and the dataset for researchers: https://jb.gg/cmg-evaluation.
翻译:提交信息生成(CMG)是软件工程中的一项关键任务,其正确评估具有挑战性。当CMG系统集成到JetBrains的IDE及其他产品中时,我们基于用户对生成信息的接受度进行在线评估。然而,对CMG系统的每次改动都进行在线实验十分繁琐,因为每次迭代都会影响用户且需要时间收集足够的统计数据。另一方面,研究文献中普遍采用的离线评估虽能促进快速实验,但使用的自动指标并不能保证代表真实用户的偏好。在本工作中,我们描述了JetBrains处理此问题的一种新方法:利用在线指标——用户在将生成信息提交至版本控制系统前所做的编辑次数——来为离线实验选择指标。为支持这种新型评估,我们开发了一种模拟CMG系统真实工作流程的新型标记收集工具,收集了一个包含57对数据的数据集(每对数据由GPT-4生成的提交信息与人类专家编辑后的对应信息组成),并设计并验证了一种合成扩展此类数据集的方法。随后,我们使用包含656对数据的最终数据集,研究了广泛使用的相似性指标与反映真实用户体验的在线指标之间的相关性。我们的结果表明,编辑距离表现出最高的相关性,而BLEU和METEOR等常用相似性指标则显示出较低的相关性。这与先前关于CMG相似性指标的研究相矛盾,表明真实场景中用户与CMG系统的交互方式,与受控研究环境中人工标注者的反馈存在显著差异。我们向研究人员公开所有代码和数据集:https://jb.gg/cmg-evaluation。