Large Language Models (LLMs) have significantly advanced software engineering (SE) tasks, with prompt engineering techniques enhancing their performance in code-related areas. However, the rapid development of foundational LLMs such as the non-reasoning model GPT-4o and the reasoning model o1 raises questions about the continued effectiveness of these prompt engineering techniques. This paper presents an extensive empirical study that reevaluates various prompt engineering techniques within the context of these advanced LLMs. Focusing on three representative SE tasks, i.e., code generation, code translation, and code summarization, we assess whether prompt engineering techniques still yield improvements with advanced models, the actual effectiveness of reasoning models compared to non-reasoning models, and whether the benefits of using these advanced models justify their increased costs. Our findings reveal that prompt engineering techniques developed for earlier LLMs may provide diminished benefits or even hinder performance when applied to advanced models. In reasoning LLMs, the ability of sophisticated built-in reasoning reduces the impact of complex prompts, sometimes making simple zero-shot prompting more effective. Furthermore, while reasoning models outperform non-reasoning models in tasks requiring complex reasoning, they offer minimal advantages in tasks that do not need reasoning and may incur unnecessary costs. Based on our study, we provide practical guidance for practitioners on selecting appropriate prompt engineering techniques and foundational LLMs, considering factors such as task requirements, operational costs, and environmental impact. Our work contributes to a deeper understanding of effectively harnessing advanced LLMs in SE tasks, informing future research and application development.
翻译:大型语言模型(LLMs)已显著推动了软件工程(SE)任务的进展,其中提示工程技术提升了其在代码相关领域的性能。然而,基础LLMs的快速发展,如非推理模型GPT-4o和推理模型o1,引发了关于这些提示工程技术持续有效性的疑问。本文通过一项广泛的实证研究,在这些先进LLMs的背景下重新评估了多种提示工程技术。聚焦于三个代表性的SE任务,即代码生成、代码翻译和代码摘要,我们评估了提示工程技术在先进模型中是否仍能带来改进、推理模型相较于非推理模型的实际有效性,以及使用这些先进模型的收益是否足以证明其增加的成本。我们的研究结果表明,为早期LLMs开发的提示工程技术在应用于先进模型时可能带来收益递减甚至阻碍性能。在推理型LLMs中,其内置的复杂推理能力降低了复杂提示的影响,有时使得简单的零样本提示更为有效。此外,虽然推理模型在需要复杂推理的任务中优于非推理模型,但在无需推理的任务中优势甚微,并可能产生不必要的成本。基于我们的研究,我们为实践者提供了关于选择合适提示工程技术和基础LLMs的实用指导,考虑了任务需求、运营成本和环境影响等因素。我们的工作有助于更深入地理解如何在SE任务中有效利用先进LLMs,为未来研究和应用开发提供参考。