We introduce a new framework, Directional Stimulus Prompting, that uses a tuneable language model (LM) to provide guidance for the black-box frozen large language model (LLM) on downstream tasks. Unlike prior work that manually or automatically finds the optimal prompt for each task, we train a policy LM to generate discrete tokens as directional stimulus of each input, which is a hint/cue such as keywords of an article for summarization. The directional stimulus is then combined with the original input and fed into the LLM to guide its generation toward the desired target. The policy LM can be trained through 1) supervised learning from annotated data and 2) reinforcement learning from offline and online rewards to explore directional stimulus that better aligns LLMs with human preferences. This framework is flexibly applicable to various LMs and tasks. To verify its effectiveness, we apply our framework to summarization and dialogue response generation tasks. Experimental results demonstrate that it can significantly improve LLMs' performance with a small collection of training data: a T5 (780M) trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex (175B)'s performance by 9.0% in ROUGE-Avg scores; only 80 dialogues can boost the combined score by 39.7%, achieving comparable or even better performance than some fully trained models on the MultiWOZ dataset. We have made our code publicly available.
翻译:我们提出了一种新框架——导向性刺激提示(Directional Stimulus Prompting),该框架使用可调语言模型(LM)为下游任务中的黑箱冻结大型语言模型(LLM)提供指导。与以往手动或自动为每个任务寻找最优提示的方法不同,我们训练一个策略LM生成离散词元作为每个输入的导向性刺激,这种刺激表现为针对输入(如总结任务中的文章关键词)的提示或线索。随后将导向性刺激与原始输入结合,馈入LLM以引导其生成朝向期望目标。策略LM可通过两种方式训练:1)从标注数据中进行监督学习;2)通过离线与在线奖励的强化学习探索更符合人类偏好的导向性刺激。该框架灵活适用于多种LM及任务。为验证其有效性,我们将该框架应用于摘要生成和对话响应生成任务。实验结果表明,只需少量训练数据即可显著提升LLM性能:在CNN/Daily Mail数据集中使用2,000个样本训练的T5(780M)模型,使Codex(175B)的ROUGE-Avg分数提升9.0%;仅用80个对话样本即可将MultiWOZ数据集的综合得分提升39.7%,达到甚至超越某些全训练模型的性能。我们已公开代码。