Prompt optimization aims to find the best prompt to a large language model (LLM) for a given task. LLMs have been successfully used to help find and improve prompt candidates for single-step tasks. However, realistic tasks for agents are multi-step and introduce new challenges: (1) Prompt content is likely to be more extensive and complex, making it more difficult for LLMs to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. While humans struggle to optimize prompts, they are good at providing feedback about LLM outputs; we therefore introduce a new LLM-driven discrete prompt optimization framework that incorporates human-designed feedback rules about potential errors to automatically offer direct suggestions for improvement. Our framework is stylized as a genetic algorithm in which an LLM generates new candidate prompts from a parent prompt and its associated feedback; we use a learned heuristic function that predicts prompt performance to efficiently sample from these candidates. This approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across eight representative multi-step tasks (an average 27.7% and 28.2% improvement to current best methods on GPT-3.5 and GPT-4, respectively). We further show that the score function for tasks can be modified to better align with individual preferences. We believe our work can serve as a benchmark for automatic prompt optimization for LLM-driven multi-step tasks. Datasets and Codes are available at https://github.com/yongchao98/PROMST. Project Page is available at https://yongchao98.github.io/MIT-REALM-PROMST.
翻译:提示优化旨在为给定任务找到用于大语言模型(LLM)的最佳提示。LLM已成功用于辅助单步任务中提示候选项的发现与改进。然而,面向智能体的实际任务通常为多步骤性质,并带来了新挑战:(1)提示内容可能更复杂冗长,使LLM分析错误更加困难;(2)单个步骤的影响难以评估;(3)不同人可能对任务执行存在多样化偏好。尽管人类难以直接优化提示,但擅长针对LLM输出提供反馈。因此,我们提出一种新型LLM驱动的离散提示优化框架,该框架将关于潜在错误的人类设计反馈规则纳入其中,以自动提供直接的改进建议。本框架采用遗传算法的范式,由LLM从父代提示及其关联反馈中生成新的候选提示;我们通过预测提示性能的学习启发式函数来高效采样这些候选方案。该方法在八个代表性多步骤任务中显著优于人工设计的提示及其他多种提示优化方法(相比GPT-3.5和GPT-4的现有最优方法分别提升27.7%和28.2%)。我们进一步证明,可通过调整任务评分函数来更好地适配个体偏好。我们认为本工作可作为LLM驱动多步骤任务中自动提示优化的基准。数据集与代码已发布于https://github.com/yongchao98/PROMST,项目主页见https://yongchao98.github.io/MIT-REALM-PROMST。