Reinforcement Learning with Human Feedback (RLHF) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While RLHF has demonstrated practical success in fine-tuning language models, existing empirical work does not address the challenge of how to efficiently sample trajectory pairs for querying human feedback. In this study, we propose an efficient sampling approach to acquiring exploratory trajectories that enable accurate learning of hidden reward functions before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing literature. Specifically, our framework can incorporate linear and low-rank MDPs. Additionally, we investigate RLHF with action-based comparison feedback and introduce an efficient querying algorithm tailored to this scenario.
翻译:人类反馈强化学习(RLHF)是一种范式,其中强化学习智能体通过学习基于轨迹的成对偏好反馈来优化任务,而非依赖明确的奖励信号。尽管RLHF在语言模型微调中已展现出实际成功,但现有实证工作并未解决如何高效采样轨迹对以查询人类反馈这一挑战。在本研究中,我们提出了一种高效采样方法,用于获取探索性轨迹,从而在收集任何人类反馈之前就能准确学习隐藏的奖励函数。理论分析表明,与现有文献相比,我们的算法在偏好模型具有线性参数化且转移函数未知的条件下,学习最优策略所需的人类反馈更少。具体而言,我们的框架可兼容线性MDP和低秩MDP。此外,我们还研究了基于动作比较反馈的RLHF,并针对此场景提出了一种高效的查询算法。