Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains. Our code is publicly released at https://github.com/typoverflow/WiseRL.
翻译:离线偏好强化学习专注于利用从离线数据集中选取的轨迹片段间的人工偏好来优化策略,已成为强化学习应用的一条实用路径。现有方法依赖于从轨迹级偏好标注中提取步进式奖励信号,其假设偏好与累积的马尔可夫奖励相关。然而,此类方法未能捕捉数据标注的整体视角:人类在评估一系列动作的合意性时,通常考虑的是整体结果而非即时奖励。为应对这一挑战,我们提出利用以轨迹片段未来结果(即后见信息)为条件的奖励来建模人类偏好。在下游强化学习优化中,每一步的奖励通过对可能未来结果进行边际化计算得到,其分布由使用离线数据集训练的变分自编码器近似。我们提出的方法——后见偏好学习(HPL)——能够充分利用海量未标注数据集中可用的轨迹数据,从而促进信用分配。全面的实证研究表明,HPL在多种领域均能提供稳健且优越的奖励。我们的代码已在 https://github.com/typoverflow/WiseRL 公开。