Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering?

Large Language Models (LLMs) have significantly advanced software engineering (SE) tasks, with prompt engineering techniques enhancing their performance in code-related areas. However, the rapid development of foundational LLMs such as the non-reasoning model GPT-4o and the reasoning model o1 raises questions about the continued effectiveness of these prompt engineering techniques. This paper presents an extensive empirical study that reevaluates various prompt engineering techniques within the context of these advanced LLMs. Focusing on three representative SE tasks, i.e., code generation, code translation, and code summarization, we assess whether prompt engineering techniques still yield improvements with advanced models, the actual effectiveness of reasoning models compared to non-reasoning models, and whether the benefits of using these advanced models justify their increased costs. Our findings reveal that prompt engineering techniques developed for earlier LLMs may provide diminished benefits or even hinder performance when applied to advanced models. In reasoning LLMs, the ability of sophisticated built-in reasoning reduces the impact of complex prompts, sometimes making simple zero-shot prompting more effective. Furthermore, while reasoning models outperform non-reasoning models in tasks requiring complex reasoning, they offer minimal advantages in tasks that do not need reasoning and may incur unnecessary costs. Based on our study, we provide practical guidance for practitioners on selecting appropriate prompt engineering techniques and foundational LLMs, considering factors such as task requirements, operational costs, and environmental impact. Our work contributes to a deeper understanding of effectively harnessing advanced LLMs in SE tasks, informing future research and application development.

翻译：大型语言模型（LLMs）已显著推动了软件工程（SE）任务的进展，其中提示工程技术提升了其在代码相关领域的性能。然而，基础LLMs的快速发展，如非推理模型GPT-4o和推理模型o1，引发了关于这些提示工程技术持续有效性的疑问。本文通过一项广泛的实证研究，在这些先进LLMs的背景下重新评估了多种提示工程技术。聚焦于三个代表性的SE任务，即代码生成、代码翻译和代码摘要，我们评估了提示工程技术在先进模型中是否仍能带来改进、推理模型相较于非推理模型的实际有效性，以及使用这些先进模型的收益是否足以证明其增加的成本。我们的研究结果表明，为早期LLMs开发的提示工程技术在应用于先进模型时可能带来收益递减甚至阻碍性能。在推理型LLMs中，其内置的复杂推理能力降低了复杂提示的影响，有时使得简单的零样本提示更为有效。此外，虽然推理模型在需要复杂推理的任务中优于非推理模型，但在无需推理的任务中优势甚微，并可能产生不必要的成本。基于我们的研究，我们为实践者提供了关于选择合适提示工程技术和基础LLMs的实用指导，考虑了任务需求、运营成本和环境影响等因素。我们的工作有助于更深入地理解如何在SE任务中有效利用先进LLMs，为未来研究和应用开发提供参考。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

斯坦福李飞飞高徒Johnson博士论文: 组成式计算机视觉智能,195页PDF

专知会员服务

71+阅读 · 2019年10月27日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日