Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In this study, we fill in such a gap between theoretical PbRL and practical algorithms by proposing a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature. Specifically, our framework can incorporate linear and low-rank MDPs with efficient sample complexity. Additionally, we investigate reward-agnostic RL with action-based comparison feedback and introduce an efficient querying algorithm tailored to this scenario.
翻译:偏好强化学习是一种范式,其中强化学习智能体通过轨迹上的成对偏好反馈(而非显式奖励信号)来学习优化任务。尽管偏好强化学习在微调语言模型中展现了实际成功,但现有理论研究主要关注遗憾最小化,未能涵盖大多数实际框架。本研究通过提出一个理论化的奖励无关偏好强化学习框架填补了这一理论偏好强化学习与实际算法之间的空白——在该框架中,在收集任何人反馈之前,先获取能够准确学习隐藏奖励函数的探索性轨迹。理论分析表明,与现有理论文献相比,我们的算法在基于线性参数化和未知转移的偏好模型下学习最优策略所需的人类反馈更少。具体而言,我们的框架能够以高效样本复杂度纳入线性及低秩马尔可夫决策过程。此外,我们研究了基于动作比较反馈的奖励无关强化学习,并针对这一场景提出了高效的查询算法。