Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. They are prone to overfit into the unexpected patterns or superficial styles in the training data. We conduct an empirical study that only selects the top-10\% most updated parameters in LLMs for alignment training, and see improvements in the convergence process and final performance. It indicates the existence of redundant neurons in LLMs for alignment training. To reduce its influence, we propose a low-redundant alignment method named \textbf{ALLO}, focusing on optimizing the most related neurons with the most useful supervised signals. Concretely, we first identify the neurons that are related to the human preference data by a gradient-based strategy, then identify the alignment-related key tokens by reward models for computing loss. Besides, we also decompose the alignment process into the forgetting and learning stages, where we first forget the tokens with unaligned knowledge and then learn aligned knowledge, by updating different ratios of neurons, respectively. Experimental results on 10 datasets have shown the effectiveness of ALLO. Our code and data are available at \url{https://github.com/RUCAIBox/ALLO}.
翻译:大语言模型(LLMs)在复杂任务和场景中与人类偏好对齐方面仍面临困难,容易过度拟合训练数据中的意外模式或表面风格。我们进行了一项实证研究,仅选择LLMs中更新幅度最大的前10%参数进行对齐训练,结果发现收敛过程和最终性能均有所提升。这表明在对齐训练中存在冗余神经元。为降低其影响,我们提出一种名为 **ALLO** 的低冗余对齐方法,专注于优化与最有效监督信号最相关的神经元。具体而言,我们首先通过基于梯度的策略识别与人类偏好数据相关的神经元,随后利用奖励模型识别对齐相关的关键词元以计算损失。此外,我们还将对齐过程分解为遗忘和学习两个阶段:通过分别更新不同比例的神经元,先遗忘包含未对齐知识的词元,再学习对齐知识。在10个数据集上的实验结果验证了ALLO的有效性。代码与数据已公开于 \url{https://github.com/RUCAIBox/ALLO}。