Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.
翻译:现有对齐方法共享一种信息流的共同拓扑结构,其中奖励信息从人类收集,通过偏好学习建模,并用于调整语言模型。然而,这一共享拓扑尚未得到系统化表征,其替代方案也未得到充分探索,导致数据效率低下和泛化不可靠的问题未能解决。作为解决方案,我们引入一个理论框架来研究基于人类反馈的强化学习(RLHF)中的奖励泛化,重点关注宏观和微观两个层面的信息流拓扑。在宏观层面,我们将RLHF信息流描绘为行为分布上的自编码过程,形式化地表达了人类偏好与模型行为之间分布一致性的RLHF目标。在微观层面,我们提出诱导贝叶斯网络作为RLHF中奖励泛化的理论,将细粒度的数据集拓扑引入泛化边界。结合两个层面的分析,我们提出了基于树结构偏好信息的奖励建模方法。理论分析表明,与基线方法相比,该方法可将奖励不确定性降低至多$\Theta(\log n/\log\log n)$倍,其中$n$为数据集规模。在三个自然语言处理任务上的验证实验表明,我们基于树的奖励模型相对于基线方法平均胜率达到65%,从而通过拓扑设计免费提升了奖励泛化能力。