Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond

Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs, which facilitate collaboration among developers and play a critical role in Open-Source Software (OSS). Very recently, Large Language Models (LLMs) have demonstrated extensive applicability in diverse code-related task. But few studies systematically explored their effectiveness using LLMs. This paper conducts the first comprehensive experiment to investigate how far we have been in applying LLM to generate high-quality commit messages. Motivated by a pilot analysis, we first clean the most widely-used CMG dataset following practitioners' criteria. Afterward, we re-evaluate diverse state-of-the-art CMG approaches and make comparisons with LLMs, demonstrating the superior performance of LLMs against state-of-the-art CMG approaches. Then, we further propose four manual metrics following the practice of OSS, including Accuracy, Integrity, Applicability, and Readability, and assess various LLMs accordingly. Results reveal that GPT-3.5 performs best overall, but different LLMs carry different advantages. To further boost LLMs' performance in the CMG task, we propose an Efficient Retrieval-based In-Context Learning (ICL) framework, namely ERICommiter, which leverages a two-step filtering to accelerate the retrieval efficiency and introduces semantic/lexical-based retrieval algorithm to construct the ICL examples. Extensive experiments demonstrate the substantial performance improvement of ERICommiter on various LLMs for code diffs of different programming languages. Meanwhile, ERICommiter also significantly reduces the retrieval time while keeping almost the same performance. Our research contributes to the understanding of LLMs' capabilities in the CMG field and provides valuable insights for practitioners seeking to leverage these tools in their workflows.

翻译：提交信息生成（CMG）方法旨在根据给定的代码差异自动生成提交信息，这有助于开发者之间的协作，并在开源软件（OSS）中发挥关键作用。最近，大型语言模型（LLMs）在各类代码相关任务中展现出广泛适用性，但鲜有研究系统性地探索其在LLMs上的有效性。本文开展了首次综合性实验，探究我们在应用LLM生成高质量提交信息方面已达到何种水平。受初步分析启发，我们首先按照从业者标准清理了使用最广泛的CMG数据集。随后，我们重新评估了多种最先进的CMG方法，并与LLMs进行了比较，证明LLMs相比现有先进CMG方法具有更优性能。接着，我们遵循开源软件实践提出了四种人工评估指标，包括准确性、完整性、适用性和可读性，并据此对多种LLM进行了评估。结果显示，GPT-3.5综合表现最佳，但不同LLM各有优势。为进一步提升LLMs在CMG任务中的性能，我们提出了一种高效检索驱动的上下文学习（ICL）框架——ERICommiter，该框架采用两步过滤机制加速检索效率，并引入基于语义/词汇的检索算法构建ICL示例。大量实验表明，ERICommiter在不同编程语言代码差异的多种LLM上均实现了显著性能提升。同时，ERICommiter在保持几乎相同性能的前提下大幅缩短了检索时间。我们的研究增进了对LLMs在CMG领域能力的理解，并为从业者在工作流程中运用这些工具提供了宝贵的见解。