Existing automatic prompt engineering methods are typically designed for discriminative tasks, where new task prompts are iteratively refined with limited feedback from a single metric reflecting a single aspect. However, these approaches are suboptimal for generative tasks, which require more nuanced guidance beyond a single numeric metric to improve the prompt and optimize multiple aspects of the generated text. To address these challenges, we propose a novel multi-aspect Critique-Suggestion-guided automatic Prompt Optimization (CriSPO) approach. CriSPO introduces a critique-suggestion module as its core component. This module spontaneously discovers aspects, and compares generated and reference texts across these aspects, providing specific suggestions for prompt modification. These clear critiques and actionable suggestions guide a receptive optimizer module to make more substantial changes, exploring a broader and more effective search space. To further improve CriSPO with multi-metric optimization, we introduce an Automatic Suffix Tuning (AST) extension to enhance the performance of task prompts across multiple metrics. We evaluate CriSPO on 4 state-of-the-art LLMs across 4 summarization and 5 QA datasets. Extensive experiments show 3-4% ROUGE score improvement on summarization and substantial improvement of various metrics on QA. Code available at https://github.com/amazon-science/crispo
翻译:现有自动提示工程方法通常针对判别式任务设计,通过单一指标反映的有限反馈迭代优化任务提示。然而,这些方法对生成式任务并非最优,因为改进提示需要超越单一数值指标的精细化指导,并需同时优化生成文本的多个维度。为解决这些挑战,我们提出了一种新颖的多维度批判-建议引导自动提示优化方法。该方法的核心组件是批判-建议模块,该模块能自主发现评估维度,在生成文本与参考文本间进行多维度比较,并为提示修改提供具体建议。这些明确的批判与可操作建议引导响应式优化器模块进行实质性修改,从而探索更广阔且更有效的搜索空间。为通过多指标优化进一步提升性能,我们引入自动后缀调优扩展模块以增强任务提示在多项指标上的表现。我们在4个最先进的大语言模型、4个摘要数据集和5个问答数据集上评估了该方法。大量实验表明:在摘要任务上ROUGE分数提升3-4%,在问答任务上各项指标均有显著提升。代码发布于 https://github.com/amazon-science/crispo