LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.

翻译：离线偏好强化学习（PbRL）提供了一种有效方法，以克服奖励设计困难和在线交互成本高昂的挑战。然而，由于偏好标注需要实时人工反馈，获取充足的偏好标签具有挑战性。为此，本文提出了一种具有高样本效率的离线偏好强化学习（LEASE）算法，该算法利用学习到的转移模型生成未标注的偏好数据。考虑到预训练的奖励模型可能为未标注数据生成错误标签，我们设计了一种不确定性感知机制来保证奖励模型的性能，仅选择高置信度、低方差的数据。此外，我们提供了奖励模型的泛化界以分析影响奖励准确性的因素，并证明了LEASE学习的策略具有理论上的改进保证。所发展的理论基于状态-动作对，可轻松与其他离线算法结合。实验结果表明，在无需在线交互的情况下，LEASE能够在较少的偏好数据下达到与基线方法相当的性能。

相关内容

Lease算法

关注 1

Lease最经典的解释来源于Lease的原始论文<>： a lease is a contract that gives its holder specific rights over property for a limited period of time 即Lease是一种带期限的契约，在此期限内拥有Lease的节点有权利操作一些预设好的对象，一般把拥有Lease节点称为Master。从更深层次上来看，Lease就是一把带有超时机制的分布式锁，如果没有Lease，分布式环境中的锁可能会因为锁拥有者的失败而导致死锁，有了lease死锁会被控制在超时时间之内。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日