Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley--Terry--Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based assumptions on the reward and transition functions (one of the most general models amenable to theoretical analysis) we develop preference-based value estimation and confidence sets tailored to end-of-episode comparisons.We prove high-probability regret bounds that scale sublinearly in the number of episodes, implying that the value of the learned policy converges to that of the optimal policy.
翻译:人类反馈通常以偏好而非校准的数值奖励形式出现,这推动了从偏好反馈中强化学习的研究,也称为基于人类反馈的强化学习。我们针对片段核马尔可夫决策过程中的纯偏好学习进行了严格的理论研究。在每个片段中,学习者从共同起始状态部署两个策略,并接收一个指示哪个轨迹更优的二元标签,该标签由累积(未观测)奖励差值的布拉德利-特里-卢斯连接函数建模。在奖励和转移函数的基于核假设(最适于理论分析的通用模型之一)下,我们开发了专为片段末比较设计的基于偏好的价值估计与置信集。我们证明了随片段数亚线性增长的高概率遗憾界,表明学习策略的价值收敛于最优策略的价值。