Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.
翻译:与人类偏好对齐是大语言模型(LLMs)期望的特性。目前,主要的对齐方法基于从人类反馈中进行强化学习(RLHF)。尽管RLHF有效,但其实现和训练过程复杂,因此近年来研究探索基于监督微调(SFT)的替代对齐方法。SFT的主要局限在于它本质上进行模仿学习,无法完全理解预期行为。为解决此问题,我们提出了一种改进的对齐方法,名为FIGA。与先前方法不同,我们整合了通过对比优劣响应派生出的细粒度(如词元或短语级别)质量信号。本研究有两项主要贡献:首先,我们构建了一个精细对齐数据集,将初始响应与对应的修订版本配对;其次,我们设计了一种新的损失函数,能够利用细粒度质量信号指导LLMs的对齐学习。通过对比多个竞争基线模型,大量实验证明了我们方法的有效性。