Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.
翻译:准确地将大语言模型(LLMs)与人类偏好对齐,对于构建公平、经济合理且统计高效的决策流程至关重要。然而,我们认为,通过奖励模型将LLMs与人类偏好对齐的主流方法——基于人类反馈的强化学习(RLHF)——因其优化过程中基于Kullback-Leibler散度的正则化而存在固有的算法偏差。在极端情况下,这种偏差可能导致我们称之为“偏好坍缩”的现象,即少数群体的偏好几乎被完全忽视。为缓解这种算法偏差,我们提出了偏好匹配(PM)RLHF,这是一种新颖的方法,可在Bradley-Terry-Luce/Plackett-Luce模型下,可证明地将LLMs与奖励模型的偏好分布对齐。我们方法的核心是一个PM正则化项,其形式为LLM策略在响应上的概率分布的负对数,这有助于LLM在响应多样化与奖励最大化之间取得平衡。值得注意的是,我们通过求解一个满足PM性质所必需的常微分方程得到了该正则化项。为便于实际应用,我们引入了适用于自然语言生成的PM RLHF条件变体。最后,我们在OPT-1.3B和Llama-2-7B模型上通过实验实证验证了条件PM RLHF的有效性,实验表明,相较于标准RLHF,其在特定度量指标下与人类偏好的对齐度提升了29%至41%。