We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
翻译:我们研究在线基于偏好的强化学习(PbRL),旨在提高样本效率。尽管受PbRL近期实证成功的推动(特别是在对齐大型语言模型方面),理论工作不断涌现,但大多数现有研究仅关注两两比较。少数近期工作(Zhu等人,2023;Mukherjee等人,2024;Thekumparampil等人,2024)探索了使用多重比较和排序反馈,但它们的性能保证未能随反馈长度的增加而改善——甚至可能恶化——尽管可获得的信息更丰富。为填补这一空白,我们采用Plackett-Luce(PL)模型对动作子集进行排序反馈,并提出M-AUPO算法,该算法通过最大化所提供子集内的平均不确定性来选择多个动作。我们证明M-AUPO实现了$\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$的次优性差距,其中$T$是总轮数,$d$是特征维度,$|S_t|$是第$t$轮子集的大小。这一结果表明,更大的子集直接带来性能提升,并且值得注意的是,该界限避免了与未知参数范数的指数依赖,这是大多数先前工作的一个基本限制。此外,我们建立了一个近乎匹配的下界$\Omega \left( \frac{d}{K \sqrt{T}} \right)$,其中$K$是最大子集大小。据我们所知,这是PbRL中首个在排序反馈下明确显示样本效率随子集大小提升的理论结果。