We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the Generalized Bilinear Preference Model (GBPM) -- capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix -- to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the squared estimation error, derived using \emph{only} strong convexity and skew-symmetry. Under a feature coverage assumption, we establish a \emph{generic} polylogarithmic regret of $\tilde{\mathcal{O}}(ηd^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$ with Greedy Sampling, and a dimension-wise improved regret (for well-conditioned arm-sets) of $\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{ηr T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$ with Explore-Then-Commit, where $η^{-1}$ is the regularization coefficient, $T$ is the time horizon, and $C_{\min}$ is an arm-set dependent quantity. This demonstrates that ``fast'' regrets are not KL-specific, but rather a fundamental consequence of generic strongly convex geometry.
翻译:本文研究通用偏好及赌博反馈下在线RLHF中的正则化最优响应最大遗憾最小化问题。尽管各类正则化技术被用于增强对齐的鲁棒性,已知的多对数遗憾保证仍高度依赖KL散度。为探究此类快速收敛速率是否可推广至KL散度之外,我们采用广义双线性偏好模型(GBPM)——通过秩为$2r$的斜对称矩阵捕捉$d$维项目特征上的非传递偏好——以隔离通用正则化的影响。关键地,在GBPM下,我们证明任意贪心策略的对偶间隙受限于平方估计误差,该归因仅利用强凸性与斜对称性。在特征覆盖假设下,我们利用贪心采样建立通用多对数遗憾$\tilde{\mathcal{O}}(ηd^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$,以及利用探索后提交策略建立维度优化的遗憾(适用于良态臂集)$\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{ηr T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$,其中$η^{-1}$为正则化系数,$T$为时间范围,$C_{\min}$为依赖臂集的量。这证明"快速"遗憾并非KL散度特有,而是通用强凸几何的基本结果。