Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a critical factor in system performance and behavior. Despite their growing role in SE research, prompt-related decisions are rarely documented in a systematic or transparent manner, hindering reproducibility and comparability across studies. To address this gap, we conducted a two-phase empirical study. First, we analyzed nearly 300 papers published at the top-3 SE conferences since 2022 to assess how prompt design, testing, and optimization are currently reported. Second, we surveyed 105 program committee members from these conferences to capture their expectations for prompt reporting in LLM-driven research. Based on the findings, we derived a structured guideline that distinguishes essential, desirable, and exceptional reporting elements. Our results reveal significant misalignment between current practices and reviewer expectations, particularly regarding version disclosure, prompt justification, and threats to validity. We present our guideline as a step toward improving transparency, reproducibility, and methodological rigor in LLM-based SE research.
翻译:大型语言模型,特别是仅解码器的生成模型(如GPT),正日益广泛地应用于自动化软件工程任务。这些模型主要通过自然语言提示进行引导,使得提示工程成为影响系统性能和行为的关键因素。尽管LLM在软件工程研究中的作用日益重要,但与提示相关的决策很少以系统或透明的方式记录,这阻碍了研究间的可复现性和可比性。为弥补这一不足,我们开展了一项两阶段的实证研究。首先,我们分析了自2022年以来在三大顶级软件工程会议上发表的近300篇论文,以评估当前如何报告提示设计、测试和优化。其次,我们调查了来自这些会议的105位程序委员会成员,以了解他们对LLM驱动研究中提示报告的期望。基于研究结果,我们制定了一份结构化指南,区分了必需、期望和卓越的报告要素。我们的研究揭示了当前实践与评审者期望之间存在显著偏差,尤其是在版本披露、提示理由说明以及有效性威胁方面。我们提出此指南,旨在推动基于LLM的软件工程研究在透明度、可复现性和方法严谨性方面取得进步。