In the short period since the release of ChatGPT in November 2022, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process (e.g., for data annotation) or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of what our community standards are for high-quality empirical studies involving LLMs.
翻译:自2022年11月ChatGPT发布以来,大语言模型(LLMs)在短时间内改变了软件工程研究格局。尽管利用LLMs支持研究或软件工程任务存在众多机遇,但严谨的科学需要严格的实证评估。然而,迄今为止,软件工程研究领域尚缺乏针对涉及LLMs研究的实施与评估的具体指南。本文重点关注两类实证研究:一类是在研究过程中使用LLMs(例如用于数据标注),另一类是对基于LLMs的现有或新工具进行评估。本文贡献了针对此类研究的首套指南。我们的目标是启动软件工程研究社区的讨论,以就涉及LLMs的高质量实证研究应遵循的社区标准达成共识。