Rethinking Information Structures in RLHF: Reward Generalization from a Graph Theory Perspective

There is a trilemma in reinforcement learning from human feedback (RLHF): the incompatibility between highly diverse contexts, low labeling cost, and reliable alignment performance. We mitigate such incompatibility through the design of dataset information structures during reward modeling, and introduce the Induced Bayesian Network (IBN), the first theory of reward generalization capable of generating substantial verified predictions on large language models (LLMs). Specifically, we first reexamine the RLHF process and propose a theoretical framework portraying it as an autoencoding process over text distributions. Our framework formalizes the RLHF objective of ensuring distributional consistency between human preference and LLM behavior. Then, based on this framework, we introduce the IBN to analyze generalization in the reward modeling stage of RLHF. Drawing from random graph theory and causal analysis, it enables empirically grounded derivation of generalization error bounds, a key improvement over classical theories of generalization. Finally, an insight from our analysis is the superiority of the tree-based information structure in reward modeling, compared to chain-based baselines in conventional RLHF methods. With IBN, we derive that in complex contexts with limited data, the tree-based reward model (RM), trained on a tree-structured preference dataset, induces up to $\Theta(\log n/\log\log n)$ times less variance than the baseline, where $n$ is the dataset size. As validation, we demonstrate that on three NLP tasks, the tree-based RM achieves 65% win rate on average against chain-based baselines. It shows that alignment performance can be gained for free via the design of dataset information structure, without the need for any other changes.

翻译：基于人类反馈的强化学习（RLHF）存在一个三难困境：高度多样化的上下文、低标注成本与可靠的齐适应性表现之间难以兼得。我们通过奖励建模过程中数据集信息结构的设计来缓解这种不兼容性，并提出诱导贝叶斯网络（IBN）——首个能够在大语言模型（LLMs）上生成大量可验证预测的奖励泛化理论。具体而言，我们首先重新审视RLHF过程，提出一个将其刻画为文本分布上自编码过程的理论框架。该框架形式化了RLHF确保人类偏好与大语言模型行为之间分布一致性的目标。接着，基于该框架，我们引入IBN来分析RLHF奖励建模阶段的泛化能力。借鉴随机图理论与因果分析，IBN能够基于经验推导泛化误差界，这是对经典泛化理论的关键改进。最后，分析揭示了一个重要洞见：相较于传统RLHF方法中基于链式结构的信息基线，树状信息结构在奖励建模中更具优越性。通过IBN推导发现，在数据有限的复杂上下文中，基于树状偏好数据集训练的树状奖励模型（RM）的方差最多比基线降低$\Theta(\log n/\log\log n)$倍，其中$n$为数据集规模。验证实验表明，在三个NLP任务中，基于树状结构的RM相较于链式结构基线平均取得65%的胜率。这说明无需任何其他修改，仅通过数据集信息结构设计即可实现对齐性能的免费提升。