By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.
翻译:通过以自然语言指令为条件,大语言模型(LLMs)展现出作为通用计算机的显著能力。然而,任务性能在很大程度上取决于用于引导模型的提示质量,而最有效的提示通常由人类手工设计。受经典程序合成和人类提示工程方法的启发,我们提出了自动提示工程师(APE),用于自动生成和选择指令。在我们的方法中,将指令视为“程序”,通过搜索由LLM提出的指令候选池来优化,以最大化选定的评分函数。为评估所选指令的质量,我们评估了另一LLM在遵循该指令时的零样本性能。在24项NLP任务上的实验表明,我们自动生成的指令大幅优于先前的LLM基线,并在19/24任务上达到与人类标注者生成的指令相当或更优的性能。我们进行了广泛的定性和定量分析,以探索APE的性能。我们证明,APE设计的提示可用于引导模型向真实性和/或信息性方向发展,并通过将其简单前置到标准上下文学习提示中,提升少样本学习性能。详情请访问我们的网页:https://sites.google.com/view/automatic-prompt-engineer。