Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{O}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $Ω\left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.

翻译：本文研究在线偏好强化学习（PbRL），旨在提升样本效率。尽管受PbRL近期实证成功（特别是在对齐大语言模型方面）的推动，理论工作不断涌现，但现有研究大多仅关注两两比较。少数近期工作（Zhu等人，2023；Mukherjee等人，2024；Thekumparampil等人，2024）探索了使用多重比较和排序反馈，但其性能保证未能随反馈长度增加而改善，甚至可能恶化，尽管可获得的信息更为丰富。为弥补这一空白，我们采用Plackett-Luce（PL）模型对动作子集进行排序反馈，并提出M-AUPO算法，该算法通过最大化所提供子集内的平均不确定性来选择多个动作。我们证明M-AUPO的次优性间隙为$\tilde{O}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$，其中$T$为总轮数，$d$为特征维度，$|S_t|$为第$t$轮子集大小。该结果表明更大的子集直接带来性能提升，且该界避免了以往大多数工作中对未知参数范数的指数依赖这一根本性局限。此外，我们建立了近乎匹配的下界$Ω\left( \frac{d}{K \sqrt{T}} \right)$，其中$K$为最大子集尺寸。据我们所知，这是PbRL领域首个在排序反馈下明确展示样本效率随子集尺寸提升的理论结果。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

面向大型推理模型的强化学习综述

专知会员服务

29+阅读 · 2025年9月11日

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日

基于人工反馈的强化学习综述

专知会员服务

65+阅读 · 2023年12月25日

牛津斯坦福等最新《元强化学习》综述，53页pdf全面阐述元强化学习方法与应用

专知会员服务

66+阅读 · 2023年1月26日