Post-training alignment has increasingly become a crucial factor in enhancing the usability of language models (LMs). However, the strength of alignment varies depending on individual preferences. This paper proposes a method to incorporate alignment control into a single model, referred to as CLM. This approach adds one identity layer preceding the initial layers and performs preference learning only on this layer to map unaligned input token embeddings into the aligned space. Experimental results demonstrate that this efficient fine-tuning method performs comparable to full fine-tuning. During inference, the input embeddings are processed through the aligned and unaligned layers, which are then merged through the interpolation coefficient. By controlling this parameter, the alignment exhibits a clear interpolation and extrapolation phenomenon.
翻译:后训练对齐日益成为提升语言模型(LMs)可用性的关键因素。然而,对齐的强度因个人偏好而异。本文提出一种将对齐控制整合到单一模型中的方法,称为CLM。该方法在初始层之前添加一个身份层,并仅在该层上进行偏好学习,以将未对齐的输入词元嵌入映射到对齐空间。实验结果表明,这种高效微调方法的性能与全参数微调相当。在推理过程中,输入嵌入分别通过对齐层和未对齐层进行处理,然后通过插值系数进行融合。通过控制该参数,对齐表现出明显的插值与外推现象。