We introduce a novel prompting framework called Directional Stimulus Prompting for guiding black-box large language models (LLMs) toward desired outputs. The framework introduces a new component called directional stimulus into the prompt, providing more fine-grained guidance and control over LLMs. The directional stimulus serves as hints or cues for each input query to guide LLMs toward the desired output, such as keywords that the desired summary should include for summarization. We utilize a small tunable model (e.g., T5) to generate such directional stimulus for each query, allowing us to optimize black-box LLMs by optimizing a small policy model. This policy model can be trained through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards to explore directional stimulus that better aligns LLMs with desired behaviors. We evaluate our framework on summarization and dialogue response generation tasks. Experimental results show that our framework consistently improves ChatGPT's performance over standard prompting with a small collection of training data, and reinforcement learning further improves the performance. Notably, on the MultWOZ dataset, our framework enables ChatGPT to achieve a remarkable 41.4% improvement in its combined score with only 80 dialogues, matching or even surpassing the performance of some fully trained state-of-the-art models. We have made our code publicly available.
翻译:我们提出了一种新型提示框架,称为方向性刺激提示,用于引导黑盒大型语言模型(LLMs)生成期望输出。该框架在提示中引入了一个名为方向性刺激的新组件,从而实现对LLMs更细粒度的引导与控制。方向性刺激为每个输入查询提供提示或线索,以引导LLMs生成期望输出,例如在摘要任务中为期望摘要包含的关键词。我们利用一个小型可调模型(如T5)为每个查询生成此类方向性刺激,从而通过优化一个小型策略模型来优化黑盒LLMs。该策略模型可通过以下两种方式训练:1)使用标注数据进行监督微调;2)通过离线或在线奖励进行强化学习,以探索能更好对齐LLMs与期望行为的方向性刺激。我们在摘要和对话响应生成任务上评估了该框架。实验结果表明,该框架在使用少量训练数据时持续提升了ChatGPT相比标准提示的性能,而强化学习进一步提升了性能。值得注意的是,在MultWOZ数据集上,该框架仅用80个对话便使ChatGPT的综合得分提升了41.4%,达到甚至超越了某些完全训练的最先进模型的性能。我们已公开代码。