Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. We propose Inference-time Policy Adapters (IPA), which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and open-domain generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline methods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models.
翻译:大型语言模型在通过示例或指令提示时,能出色完成多种语言任务。然而,仅通过提示来控制这些模型存在局限性。通过微调(如强化学习)来定制语言模型虽有效,但成本高昂且需要模型访问权限。我们提出推理时策略适配器(IPA),它能在无需微调的情况下高效定制诸如GPT-3等语言模型。IPA通过在解码阶段引入一个轻量级策略适配器来引导大型基础模型,该适配器通过强化学习针对任意用户目标进行训练。在毒性降低和开放域生成等五个具有挑战性的文本生成任务中,IPA相较于现成语言模型持续带来显著改进。它优于竞争性基线方法,有时甚至包括昂贵的微调。特别地,使用IPA定制GPT-2可超越GPT-3,而使用IPA定制GPT-3则在性能上较GPT-3有大幅提升(有时甚至超越GPT-4)。我们的这些令人鼓舞的结果凸显了IPA作为定制超大规模语言模型轻量级替代方案的潜力。