Offline Reinforcement Learning (ORL) offers a robust solution to training agents in applications where interactions with the environment must be strictly limited due to cost, safety, or lack of accurate simulation environments. Despite its potential to facilitate deployment of artificial agents in the real world, Offline Reinforcement Learning typically requires very many demonstrations annotated with ground-truth rewards. Consequently, state-of-the-art ORL algorithms can be difficult or impossible to apply in data-scarce scenarios. In this paper we propose a simple but effective Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards. Once the reward signal is modeled, we use the Reward Model to impute rewards for a large sample of reward-free transitions, thus enabling the application of ORL techniques. We demonstrate the potential of our approach on several D4RL continuous locomotion tasks. Our results show that, using only 1\% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99\% of the transitions, from which performant agents can be learned using Offline Reinforcement Learning.
翻译:离线强化学习(ORL)为在因成本、安全性或缺乏精确仿真环境而必须严格限制与环境交互的应用中训练智能体提供了稳健的解决方案。尽管离线强化学习在促进人工智能体在现实世界中的部署方面具有潜力,但它通常需要大量标注有真实奖励的演示数据。因此,在最先进的ORL算法中,在数据稀缺的场景下可能难以甚至无法应用。本文提出了一种简单而有效的奖励模型,该模型能够从仅有的少量标注有奖励的环境转移样本中估计奖励信号。一旦奖励信号被建模,我们便使用该奖励模型对大量无奖励标注的转移数据进行奖励插补,从而使得ORL技术得以应用。我们在多个D4RL连续运动任务上验证了所提方法的潜力。实验结果表明,仅使用原始数据集中1%的奖励标注转移数据,我们学习的奖励模型便能够为剩余99%的转移数据插补奖励,进而通过离线强化学习训练出性能优异的智能体。