Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study

Automated logging statement generation techniques facilitate developers in writing appropriate logging statements that document software behaviors. Current retrieval-based and learning-based logging methods fail to provide accurate logging statements in complex software. Although existing large language models (LLMs) might be a good fit for the task due to their great success in natural language generation and programming language comprehension, their effectiveness and generalization capabilities have not been explored. To this end, this paper performs the first extensive study on applying LLMs for logging statement generation. We build LogBench, the first logging statement generation dataset. On LogBench, we evaluate the effectiveness and generalization capabilities of eight state-of-the-art LLMs, which include general-purpose and code-specific models ranging from 60M to 175B in size. Specifically, we evaluate LLM's logging effectiveness by studying 1) their ability to decide logging ingredients, 2) the impact of the internal characteristics of LLMs, and 3) the influence of external factors. We further evaluate LLM's logging generalization capabilities using unseen data derived from code transformation techniques. Our study demonstrates that existing LLMs fall short of practical requirements for generating proper logging statement texts. We also disclose the impact of internal characteristics and external factors for LLMs in automated logging. In addition, we observe that existing LLMs cannot generalize to logging unseen code, revealing their unsatisfactory generalization capabilities. Based on our findings, we further discuss three implications that can enhance logging statement generation in the future, such as developing a unified metric for logging quality, incorporating shareable code knowledge into LLMs, and devising suitable prompts.

翻译：自动化日志语句生成技术帮助开发人员编写记录软件行为的恰当日志。当前的基于检索和基于学习的日志方法在复杂软件中无法提供准确的日志语句。尽管现有的大语言模型（LLMs）凭借其在自然语言生成和编程语言理解方面的巨大成功可能适合该任务，但其有效性和泛化能力尚未被探索。为此，本文首次开展了关于将LLM应用于日志语句生成的广泛研究。我们构建了首个日志语句生成数据集LogBench。在LogBench上，我们评估了八种最先进的LLM（包括通用型和代码专用模型，参数规模从60M到175B）的有效性和泛化能力。具体而言，我们通过以下三方面评估LLM的日志生成有效性：1）其决定日志要素的能力，2）LLM内部特征的影响，以及3）外部因素的影响。我们进一步利用代码变换技术生成的未见数据评估了LLM的日志泛化能力。研究表明，现有LLM在生成恰当的日志语句文本方面尚无法满足实际需求。我们还揭示了内部特征和外部因素对LLM自动化日志生成的影响。此外，我们发现现有LLM无法泛化到未见代码的日志生成，暴露出其泛化能力不足。基于研究结果，我们进一步讨论了未来可提升日志语句生成的三个启示，例如开发统一的日志质量度量标准、将可共享的代码知识融入LLM，以及设计合适的提示词。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日