Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without requiring alignment data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.
翻译:大型语言模型(LLMs)常表现出与人类偏好不一致的现象。先前研究通常收集人类偏好数据,随后通过强化学习或指令微调(即微调步骤)对预训练模型进行对齐。相比之下,在无需对齐数据的情况下实现冻结LLMs的对齐更具吸引力。本研究探索了后一种设置的潜力。我们发现,通过整合自我评估和回退机制,未对齐的LLMs可通过自我增强直接生成符合人类偏好的回应。我们提出了一种新型推理方法——可回退自回归推理(RAIN),它允许预训练LLMs评估自身生成结果,并利用评估结果引导回退与生成过程以实现AI安全性。值得注意的是,RAIN无需额外数据进行模型对齐,且无需任何训练、梯度计算或参数更新。GPT-4及人类评估的实验结果验证了RAIN的有效性:在HH数据集上,RAIN将LLaMA 30B的有害回应率从原始推理的82%降至3%,同时保持有用性指标不变;在TruthfulQA数据集上,RAIN将已良好对齐的LLaMA-2-chat 13B模型的真实性提升了5%。