Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework where one applies a reinforcement learning step after a reward modeling step has been widely adopted for the problem. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state-actions are unreliable and increase the complexity of the reinforcement learning problem at the second step. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions. The high-level idea is to limit the reinforcement learning agent to optimize over a constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.
翻译:离线环境下的偏好强化学习(PBRL)在聊天机器人等工业应用中取得了巨大成功。一种先进行奖励建模步骤再进行强化学习步骤的两步学习框架已被广泛用于该问题。然而,这种方法面临奖励黑客攻击风险与强化学习复杂性的双重挑战。为克服这些挑战,我们的核心观点是:这两类挑战均源于数据集中未覆盖的状态-动作对。此类状态-动作对具有不可靠性,并会加剧第二步强化学习问题的复杂性。基于此洞见,我们提出了一种名为PRC的新型两步学习方法:基于约束动作的偏好强化学习。其核心思想是将强化学习智能体的优化空间限制在排除分布外状态-动作对的约束动作空间内。我们通过实验验证了该方法在机器人控制环境的多种数据集上具有较高的学习效率。