Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.
翻译:离线偏好强化学习(PbRL)提供了一种有效方法,以克服奖励设计困难和在线交互成本高昂的挑战。然而,由于偏好标注需要实时人工反馈,获取充足的偏好标签具有挑战性。为此,本文提出了一种具有高样本效率的离线偏好强化学习(LEASE)算法,该算法利用学习到的转移模型生成未标注的偏好数据。考虑到预训练的奖励模型可能为未标注数据生成错误标签,我们设计了一种不确定性感知机制来保证奖励模型的性能,仅选择高置信度、低方差的数据。此外,我们提供了奖励模型的泛化界以分析影响奖励准确性的因素,并证明了LEASE学习的策略具有理论上的改进保证。所发展的理论基于状态-动作对,可轻松与其他离线算法结合。实验结果表明,在无需在线交互的情况下,LEASE能够在较少的偏好数据下达到与基线方法相当的性能。