Prompt optimization aims to find the best prompt to a large language model (LLM) for a given task. LLMs have been successfully used to help find and improve prompt candidates for single-step tasks. However, realistic tasks for agents are multi-step and introduce new challenges: (1) Prompt content is likely to be more extensive and complex, making it more difficult for LLMs to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. While humans struggle to optimize prompts, they are good at providing feedback about LLM outputs; we therefore introduce a new LLM-driven discrete prompt optimization framework PRompt Optimization in Multi-Step Tasks (PROMST) that incorporates human-designed feedback rules to automatically offer direct suggestions for improvement. We also use an extra learned heuristic model that predicts prompt performance to efficiently sample from prompt candidates. This approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across 11 representative multi-step tasks (an average 10.6\%-29.3\% improvement to current best methods on five LLMs respectively). We believe our work can serve as a benchmark for automatic prompt optimization for LLM-driven multi-step tasks. Datasets and Codes are available at https://github.com/yongchao98/PROMST. Project Page is available at https://yongchao98.github.io/MIT-REALM-PROMST.
翻译:提示优化的目标是为给定任务找到面向大语言模型(LLM)的最佳提示。LLM已成功用于帮助发现和改进单步任务的提示候选。然而,智能体所面临的实际任务通常是多步的,并带来新的挑战:(1)提示内容可能更广泛、更复杂,使得LLM更难以分析错误;(2)单个步骤的影响难以评估;(3)不同人员可能对任务执行方式存在不同偏好。尽管人类难以直接优化提示,但他们擅长对LLM输出提供反馈;因此,我们提出了一种新的LLM驱动的离散提示优化框架——多步任务中的提示优化(PROMST),该框架整合了人工设计的反馈规则,以自动提供直接的改进建议。我们还使用一个额外的学习型启发式模型来预测提示性能,从而高效地从提示候选中进行采样。该方法在11个代表性多步任务中显著优于人工设计的提示及其他多种提示优化方法(在五个LLM上分别比当前最佳方法平均提升10.6%–29.3%)。我们相信本工作可为LLM驱动的多步任务自动提示优化提供基准。数据集与代码可在 https://github.com/yongchao98/PROMST 获取。项目页面可在 https://yongchao98.github.io/MIT-REALM-PROMST 访问。