Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks

In this paper, we investigate the effectiveness of state-of-the-art LLM, i.e., GPT-4, with three different prompting engineering techniques (i.e., basic prompting, in-context learning, and task-specific prompting) against 18 fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code summarization, and code translation. Our quantitative analysis of these prompting strategies suggests that prompt engineering GPT-4 cannot necessarily and significantly outperform fine-tuning smaller/older LLMs in all three tasks. For comment generation, GPT-4 with the best prompting strategy (i.e., task-specific prompt) had outperformed the first-ranked fine-tuned model by 8.33% points on average in BLEU. However, for code generation, the first-ranked fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3% points, on average in BLEU. For code translation, GPT-4 and fine-tuned baselines tie as they outperform each other on different translation tasks. To explore the impact of different prompting strategies, we conducted a user study with 27 graduate students and 10 industry practitioners. From our qualitative analysis, we find that the GPT-4 with conversational prompts (i.e., when a human provides feedback and instructions back and forth with a model to achieve best results) showed drastic improvement compared to GPT-4 with automatic prompting strategies. Moreover, we observe that participants tend to request improvements, add more context, or give specific instructions as conversational prompts, which goes beyond typical and generic prompting strategies. Our study suggests that, at its current state, GPT-4 with conversational prompting has great potential for ASE tasks, but fully automated prompt engineering with no human in the loop requires more study and improvement.

翻译：本文研究了最先进的大语言模型（即GPT-4）结合三种不同提示工程技术（基础提示、上下文学习与任务特定提示）在三个典型自动化软件工程任务（即代码生成、代码摘要和代码翻译）中，与18个微调大语言模型相比的有效性。我们对这些提示策略的定量分析表明，采用提示工程的GPT-4未必能显著超越所有三个任务中微调较小/较旧的大语言模型。在注释生成任务中，采用最优提示策略（即任务特定提示）的GPT-4在BLEU指标上平均领先排名第一的微调模型8.33个百分点。但在代码生成任务中，排名第一的微调模型在BLEU指标上平均超越采用最优提示的GPT-4达16.61%和28.3个百分点。在代码翻译任务中，GPT-4与微调基线模型表现相当——二者在不同翻译子任务中各具优势。为探究不同提示策略的影响，我们开展了包含27名研究生和10名行业从业者的用户研究。定性分析发现，采用对话式提示（即人类与模型通过反复反馈和指令实现最佳结果）的GPT-4相比采用自动提示策略的GPT-4展现出显著改进。此外，我们观察到参与者倾向于在对话提示中要求改进、补充上下文或给出具体指令，这超越了典型的通用提示策略。研究表明，在当前阶段，采用对话式提示的GPT-4在自动化软件工程任务中具有巨大潜力，但完全无需人工参与的全自动提示工程仍需进一步研究与改进。