There is a trilemma in reinforcement learning from human feedback (RLHF): the incompatibility between highly diverse contexts, low labeling cost, and reliable alignment performance. Here we aim to mitigate such incompatibility through the design of dataset information structures during reward modeling, and meanwhile propose new methods of analysis that have wider applications, including potentially shedding light on goal misgeneralization. Specifically, we first reexamine the RLHF process and propose a theoretical framework portraying it as an autoencoding process over text distributions. Our framework formalizes the RLHF objective of ensuring distributional consistency between human preference and large language model (LLM) behavior. Under this framework, we introduce a new method based on random graph theory, the induced Bayesian network (IBN). It models generalization in the semantic space and enables empirically grounded analysis of generalization error bounds, aiming to shed light on reward generalization in RLHF. An insight from our analysis is the superiority of the tree-based information structure in reward modeling, compared to chain-based baselines in conventional RLHF methods. We derive that in complex contexts with limited data, the tree-based reward model (RM) induces up to $\Theta(\log n/\log\log n)$ times less variance than chain-based RM where $n$ is the dataset size. As validation, we demonstrate that on three NLP tasks, the tree-based RM achieves 65% win rate on average against chain-based baselines.
翻译:人类反馈强化学习(RLHF)存在一个三元悖论:高度多样化的语境、低标注成本与可靠的对齐性能之间难以兼容。本文旨在通过奖励建模过程中数据集信息结构的设计来缓解这种不兼容性,同时提出具有更广泛应用前景的新分析方法,包括可能揭示目标泛化错误问题。具体而言,我们首先重新审视RLHF过程,提出一个将其描述为文本分布上的自编码过程的理论框架。该框架形式化了RLHF的目标,即确保人类偏好与大语言模型(LLM)行为之间的分布一致性。在此框架下,我们引入一种基于随机图理论的新方法——诱导贝叶斯网络(IBN)。该方法在语义空间中建模泛化,并能够基于经验分析泛化误差界,旨在揭示RLHF中的奖励泛化机制。分析的核心洞见是:与常规RLHF方法中基于链式结构的基线相比,树状信息结构在奖励建模中具有显著优势。我们推导得出,在复杂语境且数据量有限的情况下,树状奖励模型(RM)的方差比链式RM最多低$\Theta(\log n/\log\log n)$倍(其中$n$为数据集规模)。验证实验表明,在三项自然语言处理任务中,树状RM相较于链式基线平均取得65%的胜率。