Alignment training is crucial for enabling large language models (LLMs) to cater to human intentions and preferences. It is typically performed based on two stages with different objectives: instruction-following alignment and human-preference alignment. However, aligning LLMs with these objectives in sequence suffers from an inherent problem: the objectives may conflict, and the LLMs cannot guarantee to simultaneously align with the instructions and human preferences well. To response to these, in this work, we propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods. The basic idea is to alternate between different objectives during alignment training, so that better collaboration can be achieved between the two alignment tasks.We experiment with Hbat on summarization and dialogue tasks. Experimental results show that the proposed \textsc{Hbat} can significantly outperform all baselines. Notably, Hbat yields consistent performance gains over the traditional two-stage alignment training when using both proximal policy optimization and direct preference optimization.
翻译:对齐训练对于使大型语言模型(LLMs)能够满足人类意图和偏好至关重要。它通常基于两个具有不同目标的阶段进行:指令遵循对齐和人类偏好对齐。然而,按顺序使用这些目标对齐LLMs存在一个固有问题:目标之间可能冲突,且LLMs无法保证同时良好地对齐指令和人类偏好。为解决这些问题,在本工作中,我们提出了一种混合对齐训练(Hbat)方法,该方法基于交替对齐和改进的弹性权重巩固方法。其基本思想是在对齐训练期间在不同目标之间交替进行,从而使两个对齐任务之间能够实现更好的协作。我们在摘要和对话任务上对Hbat进行了实验。实验结果表明,所提出的\textsc{Hbat}能够显著优于所有基线方法。值得注意的是,当同时使用近端策略优化和直接偏好优化时,Hbat相较于传统的两阶段对齐训练能带来一致的性能提升。