Rethinking Information Structures in RLHF: Reward Generalization from a Graph Theory Perspective

There is a trilemma in reinforcement learning from human feedback (RLHF): the incompatibility between highly diverse contexts, low labeling cost, and reliable alignment performance. Here we aim to mitigate such incompatibility through the design of dataset information structures during reward modeling, and meanwhile propose new, generalizable methods of analysis that have wider applications, including potentially shedding light on goal misgeneralization. Specifically, we first reexamine the RLHF process and propose a theoretical framework portraying it as an autoencoding process over text distributions. Our framework formalizes the RLHF objective of ensuring distributional consistency between human preference and large language model (LLM) behavior. Based on this framework, we introduce a new method to model generalization in the reward modeling stage of RLHF, the induced Bayesian network (IBN). Drawing from random graph theory and causal analysis, it enables empirically grounded derivation of generalization error bounds, a key improvement over classical methods of generalization analysis. An insight from our analysis is the superiority of the tree-based information structure in reward modeling, compared to chain-based baselines in conventional RLHF methods. We derive that in complex contexts with limited data, the tree-based reward model (RM) induces up to $\Theta(\log n/\log\log n)$ times less variance than chain-based RM where $n$ is the dataset size. As validation, we demonstrate that on three NLP tasks, the tree-based RM achieves 65% win rate on average against chain-based baselines. Looking ahead, we hope to extend the IBN analysis to help understand the phenomenon of goal misgeneralization.

翻译：从人类反馈中进行强化学习（RLHF）存在一个三难困境：高度多样化的上下文、低标注成本和可靠的对齐性能三者之间难以兼容。本文旨在通过奖励建模过程中数据集信息结构的设计来缓解这种不兼容性，同时提出新的、具有泛化能力的分析方法，这些方法具有更广泛的应用，并可能为理解目标泛化错误提供启示。具体而言，我们首先重新审视RLHF过程，并提出了一个理论框架，将其描述为文本分布上的自编码过程。该框架形式化了RLHF的目标，即确保人类偏好与大语言模型（LLM）行为之间的分布一致性。基于该框架，我们引入了一种新方法——诱导贝叶斯网络（IBN），用于对RLHF奖励建模阶段的泛化进行建模。借鉴随机图理论和因果分析，该方法能够从经验出发推导泛化误差界，这是对经典泛化分析方法的关键改进。我们的分析揭示了一个重要见解：与常规RLHF方法中基于链的基线相比，基于树的信息结构在奖励建模中具有优越性。我们推导出，在数据有限的复杂上下文中，基于树的奖励模型（RM）产生的方差比基于链的RM低至多$\Theta(\log n/\log\log n)$倍，其中$n$为数据集大小。作为验证，我们在三个NLP任务上展示了基于树的RM相对于基于链的基线平均达到65%的胜率。展望未来，我们希望扩展IBN分析以帮助理解目标泛化错误现象。