Prompt engineering is a challenging and important task due to the high sensitivity of Large Language Models (LLMs) to the given prompt and the inherent ambiguity of a textual task instruction. Automatic prompt engineering is essential to achieve optimized performance from LLMs. Recent studies have demonstrated the capabilities of LLMs to automatically conduct prompt engineering by employing a meta-prompt that incorporates the outcomes of the last trials and proposes an improved prompt. However, this requires a high-quality benchmark to compare different prompts, which is difficult and expensive to acquire in many real-world use cases. In this work, we introduce a new method for automatic prompt engineering, using a calibration process that iteratively refines the prompt to the user intent. During the optimization process, the system jointly generates synthetic data of boundary use cases and optimizes the prompt according to the generated dataset. We demonstrate the effectiveness of our method with respect to strong proprietary models on real-world tasks such as moderation and generation. Our method outperforms state-of-the-art methods with a limited number of annotated samples. Furthermore, we validate the advantages of each one of the system's key components. Our system is built in a modular way, facilitating easy adaptation to other tasks. The code is available $\href{https://github.com/Eladlev/AutoPrompt}{here}$.
翻译:提示工程是一项具有挑战性且重要的任务,因为大型语言模型(LLMs)对给定提示高度敏感,且文本任务指令本身存在固有歧义。自动提示工程对于实现LLMs的优化性能至关重要。近期研究表明,LLMs能够通过使用包含前次试验结果的元提示并生成改进提示来自动执行提示工程。然而,这需要高质量基准来比较不同提示,而在许多实际应用场景中获取此类基准既困难又昂贵。本文提出了一种新的自动提示工程方法,采用迭代校准过程逐步将提示优化至用户意图。在优化过程中,系统联合生成边界用例的合成数据,并根据生成的数据集优化提示。我们在审核和生成等实际任务中验证了该方法相对于强专有模型的有效性。本方法在标注样本数量有限的情况下优于现有最优方法。此外,我们验证了系统各关键组件的优势。系统采用模块化设计,便于适应其他任务。代码可在$\href{https://github.com/Eladlev/AutoPrompt}{此处}$获取。