Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.
翻译:乐观探索是提升基于人类反馈的强化学习样本效率的关键,然而现有用于激励探索的奖励方法往往无法实现真正的乐观性。我们通过理论分析表明,当前基于KL散度或α-散度正则化的方法会无意中将探索偏向参考模型的高概率区域,从而强化保守行为而非促进对不确定区域的发现。为克服这一缺陷,我们提出了通用探索性奖励(GEB)这一新颖的理论框架,该框架可证明满足乐观性原则。GEB通过参考依赖的奖励调节机制抵消散度诱导的偏差,将先前启发式奖励方法统一为特例,并自然地扩展至完整的α-散度族。实验表明,在多种散度设置和大语言模型骨干网络上,GEB在对齐任务中持续优于基线方法。这些结果证明GEB为RLHF中的乐观探索提供了兼具理论原则性与实践可行性的解决方案。