The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.
翻译:大语言模型(LLM)的对齐微调过程通常包括通过监督微调(SFT)进行的指令学习和通过基于人类反馈的强化学习(RLHF)进行的偏好微调。最近一项研究LIMA(Zhou et al., 2023)表明,仅使用1K个SFT示例就能实现显著的对其性能,这暗示着对齐微调的效果可能具有"表面性"。这引发了关于对齐微调如何具体改变基础LLM的疑问。我们通过分析基础LLM与其对齐版本之间的词元分布偏移来研究对齐微调的效果。研究发现,基础LLM及其对齐微调版本在多数词元位置上的解码表现几乎相同,大部分分布偏移发生在风格性词元上。这些直接证据有力支持了LIMA提出的表面性对齐假说。基于此,我们提出研究问题:在没有SFT或RLHF的情况下,如何有效地对齐基础LLM?为此,我们引入了一种简单的无需微调的对齐方法URIAL。URIAL仅通过基础LLM的上下文学习(ICL)即可实现有效对齐,仅需三个固定的风格性示例和一个系统提示。我们在名为JUST-EVAL-INSTRUCT的多样化示例集上进行了细粒度且可解释的评估。结果表明,使用URIAL的基础LLM能够匹配甚至超越通过SFT或SFT+RLHF对齐的LLM性能。我们证明,通过策略性提示和ICL,无需微调与基于微调的对齐方法之间的性能差距可以显著缩小。关于对齐微调表面性的发现以及URIAL的结果表明,对对齐进行更深入的分析和理论理解对未来LLM研究至关重要。