Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm achieves an $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $\eta, N_{\mathcal R},T,d_{\mathcal R}$ denote the KL-regularization parameter, the cardinality of the reward function class, number of rounds, and the complexity of the reward function class. Furthermore, we extend our algorithm and analysis to reinforcement learning by developing a novel decomposition over transition steps and also obtain a similar logarithmic regret bound.
翻译:近期基于人类反馈的强化学习(RLHF)进展表明,KL正则化在提升大语言模型(LLM)强化学习微调效率方面发挥着关键作用。尽管其具备经验优势,KL正则化强化学习与标准强化学习之间的理论差异在很大程度上仍未得到充分探索。尽管近期已有工作对决策中KL正则化目标进行理论分析 \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp},但这些分析要么退化为传统强化学习设定,要么依赖于强覆盖假设。本文提出一种基于乐观策略的KL正则化在线上下文赌博机算法,并对其遗憾进行了新颖的理论分析。通过巧妙利用KL正则化诱导的良性优化景观以及乐观奖励估计,我们的算法实现了 $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ 的对数遗憾界,其中 $\eta, N_{\mathcal R},T,d_{\mathcal R}$ 分别表示KL正则化参数、奖励函数类的基数、交互轮数以及奖励函数类的复杂度。此外,我们通过提出一种基于状态转移步骤的新颖分解方法,将算法与分析扩展至强化学习场景,并获得了类似的对数遗憾界。