We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer (where $η^{-1}$ is the regularization strength), generalizing beyond prior works limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of GBPM.Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, $e^{O(η)}$-free regret $\tilde{O}(ηd^4 (\log T)^2)$. (2) Explore-Then-Commit achieves $\mathrm{poly}(d)$-free regret $\tilde{O}(\sqrt{ηr T})$ by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.
翻译:本文研究具有一般偏好的上下文在线RLHF问题,其目标是识别纳什均衡。我们采用广义双线性偏好模型(GBPM),通过低秩斜对称矩阵捕捉潜在的不可传递偏好。我们研究具有任意强凸正则化器(其中$η^{-1}$为正则化强度)的一般偏好学习,推广了先前局限于反向KL正则化的工作。我们分析的核心在于证明贪婪策略的对偶间隙受估计误差平方的约束——这一结果仅由强凸性和GBPM的斜对称性导出。基于这一洞见和特征多样性假设,我们通过两种简单算法建立了两个遗憾界:(1)贪婪采样算法实现了多对数、不依赖$e^{O(η)}$的遗憾$\tilde{O}(ηd^4 (\log T)^2)$。(2)探索后提交算法通过利用低秩结构实现了不依赖$\mathrm{poly}(d)$的遗憾$\tilde{O}(\sqrt{ηr T})$;这是高维在线RLHF中首个统计效率保证。