Automated logging statement generation techniques facilitate developers in writing appropriate logging statements that document software behaviors. Current retrieval-based and learning-based logging methods fail to provide accurate logging statements in complex software. Although existing large language models (LLMs) might be a good fit for the task due to their great success in natural language generation and programming language comprehension, their effectiveness and generalization capabilities have not been explored. To this end, this paper performs the first extensive study on applying LLMs for logging statement generation. We build LogBench, the first logging statement generation dataset. On LogBench, we evaluate the effectiveness and generalization capabilities of eight state-of-the-art LLMs, which include general-purpose and code-specific models ranging from 60M to 175B in size. Specifically, we evaluate LLM's logging effectiveness by studying 1) their ability to decide logging ingredients, 2) the impact of the internal characteristics of LLMs, and 3) the influence of external factors. We further evaluate LLM's logging generalization capabilities using unseen data derived from code transformation techniques. Our study demonstrates that existing LLMs fall short of practical requirements for generating proper logging statement texts. We also disclose the impact of internal characteristics and external factors for LLMs in automated logging. In addition, we observe that existing LLMs cannot generalize to logging unseen code, revealing their unsatisfactory generalization capabilities. Based on our findings, we further discuss three implications that can enhance logging statement generation in the future, such as developing a unified metric for logging quality, incorporating shareable code knowledge into LLMs, and devising suitable prompts.
翻译:自动化日志语句生成技术帮助开发人员编写记录软件行为的恰当日志。当前的基于检索和基于学习的日志方法在复杂软件中无法提供准确的日志语句。尽管现有的大语言模型(LLMs)凭借其在自然语言生成和编程语言理解方面的巨大成功可能适合该任务,但其有效性和泛化能力尚未被探索。为此,本文首次开展了关于将LLM应用于日志语句生成的广泛研究。我们构建了首个日志语句生成数据集LogBench。在LogBench上,我们评估了八种最先进的LLM(包括通用型和代码专用模型,参数规模从60M到175B)的有效性和泛化能力。具体而言,我们通过以下三方面评估LLM的日志生成有效性:1)其决定日志要素的能力,2)LLM内部特征的影响,以及3)外部因素的影响。我们进一步利用代码变换技术生成的未见数据评估了LLM的日志泛化能力。研究表明,现有LLM在生成恰当的日志语句文本方面尚无法满足实际需求。我们还揭示了内部特征和外部因素对LLM自动化日志生成的影响。此外,我们发现现有LLM无法泛化到未见代码的日志生成,暴露出其泛化能力不足。基于研究结果,我们进一步讨论了未来可提升日志语句生成的三个启示,例如开发统一的日志质量度量标准、将可共享的代码知识融入LLM,以及设计合适的提示词。