Existing methods for controlling language models, such as RLHF and Constitutional AI, involve determining which LLM behaviors are desirable and training them into a language model. However, in many cases, it is desirable for LLMs to be controllable \textit{at inference time}, so that they can be used in multiple contexts with diverse needs. We illustrate this with the \textbf{Pink Elephant Problem}: instructing an LLM to avoid discussing a certain entity (a ``Pink Elephant''), and instead discuss a preferred entity (``Grey Elephant''). We apply a novel simplification of Constitutional AI, \textbf{Direct Principle Feedback}, which skips the ranking of responses and uses DPO directly on critiques and revisions. Our results show that after DPF fine-tuning on our synthetic Pink Elephants dataset, our 13B fine-tuned LLaMA 2 model significantly outperforms Llama-2-13B-Chat and a prompted baseline, and performs as well as GPT-4 in on our curated test set assessing the Pink Elephant Problem.
翻译:现有的语言模型控制方法,如RLHF和宪法AI,通过确定LLM的期望行为并将其训练到模型中。然而,在许多情况下,希望LLM在推理时具有可控性,以便能在不同需求的场景中使用。我们通过“粉红大象问题”阐述了这一点:指示LLM避免讨论某个特定实体(“粉红大象”),转而讨论优先实体(“灰大象”)。我们应用了宪法AI的一种新颖简化方法——**直接原则反馈**,该方法跳过响应排名,直接使用DPO对批评和修订进行处理。结果表明,在合成粉红大象数据集上进行DPF微调后,我们的13B微调LLaMA 2模型显著优于Llama-2-13B-Chat和提示基线,并在针对粉红大象问题设计的评估集上达到与GPT-4相当的性能。