Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same $\mathcal{O}(1 / \epsilon^2)$ sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an $\mathcal{O}(1 / \epsilon)$ sample complexity when $\epsilon$ is sufficiently small. We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.

翻译：反向Kullback-Leibler（KL）正则化已成为强化学习（RL）和基于人类反馈的强化学习（RLHF）中增强策略优化的主流技术，其迫使学习策略保持接近参考策略。尽管KL正则化的有效性和必要性已在各种实际场景中得到实证验证，但当前对KL正则化RLHF的理论分析仍得到与无KL正则化问题相同的$\mathcal{O}(1 / \epsilon^2)$样本复杂度。为理解含KL正则化与不含KL正则化的策略学习目标之间的根本区别，我们首次通过为KL正则化上下文赌博机和RLHF提供尖锐分析，从理论上证明了KL正则化的效力，揭示了当$\epsilon$足够小时，可获得$\mathcal{O}(1 / \epsilon)$样本复杂度。我们进一步探讨了数据覆盖在上下文赌博机和RLHF中的作用。尽管覆盖假设通常用于离线RLHF中以将参考策略的样本与最优策略联系起来，但往往以对覆盖系数的乘法依赖为代价，其对在线RLHF样本复杂度的影响仍不明确。先前在线RLHF的理论分析通常要求显式探索和对奖励函数类的额外结构假设。相比之下，我们证明，在参考策略提供充分覆盖的情况下，一种简单的两阶段混合采样策略可以实现仅对覆盖系数具有加法依赖的样本复杂度。我们的结果为理解KL正则化和数据覆盖在RLHF中的作用提供了全面视角，为设计更高效的RLHF算法提供了启示。