This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with pre-existing offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available at https://github.com/bkkgbkjb/OPPO .
翻译:本研究聚焦于离线偏好强化学习(PbRL)主题,这是传统强化学习的一种变体,无需在线交互或指定奖励函数。相反,智能体利用预先存在的离线轨迹及轨迹对之间的人类偏好,分别提取动态特征和任务信息。由于动态特征与任务信息正交,朴素方法是通过基于偏好的奖励学习,再结合现成的离线强化学习算法。然而,这需要单独学习一个标量奖励函数,该函数被认为是信息瓶颈。为了解决这一问题,我们提出离线偏好引导的策略优化(OPPO)范式,该范式通过单步过程直接建模离线轨迹与偏好,无需单独学习奖励函数。OPPO通过引入离线事后信息匹配目标以优化上下文策略,以及偏好建模目标以寻找最优上下文来实现这一点。OPPO通过迭代优化这两个目标,进一步集成了高性能决策策略。我们的实验结果表明,OPPO能够有效建模离线偏好,并优于先前的竞争基线方法,包括在真实或伪奖励函数规范下运行的离线强化学习算法。我们的代码开源在 https://github.com/bkkgbkjb/OPPO。