Training offline reinforcement learning (RL) models using visual inputs poses two significant challenges, i.e., the overfitting problem in representation learning and the overestimation bias for expected future rewards. Recent work has attempted to alleviate the overestimation bias by encouraging conservative behaviors. This paper, in contrast, tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages. The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the "test bed" for offline policies. To enable effective online-to-offline knowledge transfer, we introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces. Experimental results demonstrate the effectiveness of CoWorld, outperforming existing RL approaches by large margins.
翻译:使用视觉输入训练离线强化学习模型面临两大挑战,即表示学习中的过拟合问题以及对预期未来奖励的高估偏差。近期研究尝试通过鼓励保守行为来缓解高估偏差。与之相比,本文致力于为价值估计构建更灵活的约束条件,同时避免阻碍潜在优势的探索。核心思想是利用现成的强化学习模拟器(可在线方式轻松交互)作为离线策略的"测试平台"。为实现有效的在线到离线知识迁移,我们提出CoWorld——一种基于模型的强化学习方法,该方法能缓解状态空间与奖励空间中的跨域差异。实验结果表明,CoWOrld具有显著有效性,其性能大幅超越现有强化学习方法。