Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a balance between maintaining fidelity to the behavior policy with high in-dataset trajectory returns and selecting optimal actions based on high reward labels. Additionally, we introduce an ensemble normalization technique that effectively integrates multiple reward models, balancing the tradeoff between reward differentiation and accuracy. Empirical evaluations on various benchmarks demonstrate the superiority of DTR over other state-of-the-art baselines
翻译:离线偏好强化学习通常分为两个阶段:首先,利用人类偏好学习奖励模型并为无奖励的离线数据集标注奖励;其次,通过离线强化学习优化习得的奖励来学习策略。然而,从轨迹级偏好反馈中准确建模步进奖励存在固有挑战。由此引入的奖励偏差(特别是预测奖励的高估)会导致乐观的轨迹拼接,从而破坏离线强化学习阶段至关重要的悲观机制。为解决这一挑战,我们提出用于离线偏好强化学习的基于数据集的轨迹回报正则化方法,该方法利用条件序列建模来降低在奖励偏差下学习不准确轨迹拼接的风险。具体而言,DTR 采用 Decision Transformer 与 TD-Learning 相结合,在保持对具有高数据集内轨迹回报的行为策略的保真度与基于高奖励标签选择最优动作之间取得平衡。此外,我们提出一种集成归一化技术,能有效整合多个奖励模型,平衡奖励区分度与准确性之间的权衡。在多个基准测试上的实证评估表明,DTR 优于其他最先进的基线方法。