When a Commit Message Generation (CMG) system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers within controlled environments. We release all the code and the dataset to support future research in the field: https://jb.gg/cmg-evaluation.
翻译:当提交信息生成系统集成到JetBrains的集成开发环境及其他产品时,我们基于用户对生成信息的接受度进行在线评估。然而,每次对提交信息生成系统的改动都进行在线实验十分繁琐,因为每次迭代都会影响用户且需要时间收集足够的统计数据。另一方面,研究文献中普遍采用的离线评估方法虽能加速实验进程,但使用的自动评估指标并不能保证反映真实用户的偏好。本工作中,我们描述了JetBrains为解决此问题采用的新方法:通过利用在线指标——即用户在将生成信息提交至版本控制系统前所做的编辑次数——来筛选适用于离线实验的评估指标。为支持此类新型评估,我们开发了一种模拟提交信息生成系统真实工作流程的新型标注收集工具,收集了包含57组由GPT-4生成的提交信息与经人类专家编辑后的对应信息的数据集,并设计验证了扩展此类数据集的合成方法。随后,我们使用最终包含656组数据的完整数据集,研究了广泛使用的相似性指标与反映真实用户体验的在线指标之间的相关性。研究结果表明,编辑距离与在线指标呈现最高相关性,而BLEU和METEOR等常用相似性指标则表现出较低相关性。这与先前关于提交信息生成相似性指标的研究相悖,表明真实场景中用户与提交信息生成系统的交互模式,与受控环境下人工标注者的反馈存在显著差异。我们公开了所有代码与数据集以支持该领域的后续研究:https://jb.gg/cmg-evaluation。