In this work, we address the problem of directing the text generation of a language model (LM) towards a desired behavior, aligning the generated text with the preferences of the human operator. We propose using another, instruction-tuned language model as a critic reward model in a zero-shot way thanks to the prompt of a Yes-No question that represents the user preferences, without requiring further labeled data. This zero-shot reward model provides the learning signal to further fine-tune the base LM using Reinforcement Learning from AI Feedback (RLAIF); yet our approach is also compatible in other contexts such as quality-diversity search. Extensive evidence of the capabilities of the proposed ZYN framework is provided through experiments in different domains related to text generation, including detoxification; optimizing sentiment of movie reviews, or any other attribute; steering the opinion about a particular topic the model may have; and personalizing prompt generators for text-to-image tasks. Code available at \url{https://github.com/vicgalle/zero-shot-reward-models/}.
翻译:本研究致力于解决语言模型(LM)文本生成行为的定向引导问题,使生成文本与人类操作者的偏好对齐。我们提出利用另一经指令微调的语言模型作为零样本批判奖励模型,通过代表用户偏好的是非问题提示实现,无需额外标注数据。该零样本奖励模型能为基于AI反馈的强化学习(RLAIF)提供学习信号,以进一步微调基础语言模型;同时本方法也适用于质量多样性搜索等其他场景。通过在文本生成相关领域的广泛实验——包括去毒化、电影评论情感优化及其他属性调控、引导模型对特定话题的观点倾向、以及为文生图任务定制提示生成器——我们充分验证了所提出的ZYN框架的能力。代码见\url{https://github.com/vicgalle/zero-shot-reward-models/}。