We introduce a new framework, Directional Stimulus Prompting, that uses a tuneable language model (LM) to provide guidance for the black-box frozen large language model (LLM) on downstream tasks. Unlike prior work that manually or automatically finds the optimal prompt for each task, we train a policy LM to generate discrete tokens as ``directional stimulus'' of each input, which is a hint/cue such as keywords of an article for summarization. The directional stimulus is then combined with the original input and fed into the LLM to guide its generation toward the desired target. The policy LM can be trained through 1) supervised learning from annotated data and 2) reinforcement learning from offline and online rewards to explore directional stimulus that better aligns LLMs with human preferences. This framework is flexibly applicable to various LMs and tasks. To verify its effectiveness, we apply our framework to summarization and dialogue response generation tasks. Experimental results demonstrate that it can significantly improve LLMs' performance with a small collection of training data: a T5 (780M) trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex (175B)'s performance by 7.2% in ROUGE-Avg scores; 500 dialogues boost the combined score by 52.5%, achieving comparable or even better performance than fully trained models on the MultiWOZ dataset.
翻译:我们提出了一种新框架——方向性刺激提示(Directional Stimulus Prompting),该框架利用可调语言模型(LM)为下游任务中的黑盒冻结大型语言模型(LLM)提供指导。与以往手动或自动寻找每项任务最优提示的研究不同,我们训练一个策略LM为每个输入生成离散令牌作为“方向性刺激”,即提示/线索,例如用于摘要的文章关键词。随后,方向性刺激与原始输入相结合,输入LLM以引导其生成朝着期望目标方向进行。策略LM可通过以下两种方式训练:1)基于标注数据的监督学习;2)基于离线与在线奖励的强化学习,探索更符合人类偏好的方向性刺激。该框架灵活适用于各类LM及任务。为验证其有效性,我们将其应用于摘要生成与对话响应生成任务。实验结果表明,该框架仅需少量训练数据即可显著提升LLM性能:基于CNN/Daily Mail数据集,使用2000个样本训练的T5(780M)使Codex(175B)的ROUGE平均分提升7.2%;500个对话样本使MultiWOZ数据集的综合得分提升52.5%,达到甚至超越完全训练模型的表现。