Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, the so-called finetuning step. In contrast, aligning frozen LLMs without any extra data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide backward rewind and forward generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates; during the self-evaluation phase, the model receives guidance on which human preference to align with through a fixed-template prompt, eliminating the need to modify the initial prompt. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B over vanilla inference from 82% to 97%, while maintaining the helpfulness rate. Under the leading adversarial attack llm-attacks on Vicuna 33B, RAIN establishes a new defense baseline by reducing the attack success rate from 94% to 19%.
翻译:大语言模型(LLMs)常表现出与人类偏好不一致的问题。以往研究通过收集人类偏好数据,再使用强化学习或指令微调(即所谓的微调阶段)对齐预训练模型。相比之下,无需额外数据即可对齐冻结的LLMs更具吸引力。本文探索了后一种设置的可能性。我们发现,通过整合自我评估与回溯机制,未对齐的LLMs可通过自我增强直接生成符合人类偏好的响应。我们提出一种新型推理方法——可回溯自回归推理(RAIN),使预训练LLMs能够评估自身生成结果,并利用评估结果引导向后回溯与向前生成,以实现AI安全。值得注意的是,RAIN无需额外数据即可实现模型对齐,且完全无需训练、梯度计算或参数更新;在自我评估阶段,模型通过固定模板提示获得需对齐的人类偏好指导,无需修改初始提示。由GPT-4和人类评估的实验结果证明了RAIN的有效性:在HH数据集上,RAIN将LLaMA 30B的无害率从普通推理的82%提升至97%,同时保持有用率不变;在针对Vicuna 33B的领先对抗攻击llm-attacks下,RAIN将攻击成功率从94%降至19%,建立了新的防御基线。