Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs, which facilitate collaboration among developers and play a critical role in Open-Source Software (OSS). Very recently, Large Language Models (LLMs) have demonstrated extensive applicability in diverse code-related task. But few studies systematically explored their effectiveness using LLMs. This paper conducts the first comprehensive experiment to investigate how far we have been in applying LLM to generate high-quality commit messages. Motivated by a pilot analysis, we first clean the most widely-used CMG dataset following practitioners' criteria. Afterward, we re-evaluate diverse state-of-the-art CMG approaches and make comparisons with LLMs, demonstrating the superior performance of LLMs against state-of-the-art CMG approaches. Then, we further propose four manual metrics following the practice of OSS, including Accuracy, Integrity, Applicability, and Readability, and assess various LLMs accordingly. Results reveal that GPT-3.5 performs best overall, but different LLMs carry different advantages. To further boost LLMs' performance in the CMG task, we propose an Efficient Retrieval-based In-Context Learning (ICL) framework, namely ERICommiter, which leverages a two-step filtering to accelerate the retrieval efficiency and introduces semantic/lexical-based retrieval algorithm to construct the ICL examples. Extensive experiments demonstrate the substantial performance improvement of ERICommiter on various LLMs for code diffs of different programming languages. Meanwhile, ERICommiter also significantly reduces the retrieval time while keeping almost the same performance. Our research contributes to the understanding of LLMs' capabilities in the CMG field and provides valuable insights for practitioners seeking to leverage these tools in their workflows.
翻译:提交信息生成(CMG)方法旨在基于给定代码差异自动生成提交信息,这有助于开发者间的协作,并在开源软件(OSS)中发挥关键作用。近期,大语言模型(LLMs)在各类代码相关任务中展现出广泛适用性,但少有研究系统性地探讨其应用效果。本文首次通过全面实验探究当前应用LLM生成高质量提交信息的进展程度。基于初步分析,我们首先依据实践者标准清理了最广泛使用的CMG数据集。随后,我们重新评估了多种前沿CMG方法,并与LLMs进行比较,证明了LLMs相对于现有最优CMG方法的卓越性能。接着,我们进一步遵循OSS实践提出四项人工评估指标——准确性、完整性、适用性与可读性,并据此评估了多种LLMs。结果表明GPT-3.5在整体上表现最佳,但不同LLMs各具优势。为进⼀步提升LLMs在CMG任务中的性能,我们提出了⼀种高效的基于检索的上下文学习框架ERICommiter,该框架通过两步过滤机制加速检索效率,并引入基于语义/词法的检索算法构建上下文学习示例。大量实验表明,ERICommiter在不同编程语言的代码差异上均能显著提升各类LLMs的性能表现。同时,该框架在保持几乎相同性能的前提下大幅降低了检索时间。本研究有助于深化对LLMs在CMG领域能力的理解,并为寻求在工作流中运用这些工具的实践者提供了重要参考。