While many real-world problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed data-driven approaches that learn entirely from samples from the transition dynamics and examples of high-return states. These methods typically learn a reward function from high-return states, use that reward function to label the transitions, and then apply an offline RL algorithm to these transitions. While these methods can achieve good results on many tasks, they can be complex, often requiring regularization and temporal difference updates. In this paper, we propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function. We show that this implicit model can represent the Q-values for the example-based control problem. Across a range of state-based and image-based offline control tasks, our method outperforms baselines that use learned reward functions; additional experiments demonstrate improved robustness and scaling with dataset size.
翻译:尽管许多现实世界问题可能受益于强化学习,但这些任务很少完全符合马尔可夫决策过程(MDP)的框架:与环境交互通常代价高昂,且奖励函数的设计充满挑战。为解决这些难题,先前研究开发了完全从转移动力学样本和高回报状态示例中学习的数据驱动方法。这类方法通常从高回报状态中学习奖励函数,利用该函数标注转移数据,随后对标注后的转移数据应用离线强化学习算法。尽管这些方法在众多任务中能取得良好效果,但它们往往结构复杂,常需引入正则化和时间差分更新。本文提出一种基于示例的离线控制方法,该方法不学习奖励函数,而是对多步转移的隐式模型进行建模。研究表明,该隐式模型能够表征示例控制问题的Q值。在一系列基于状态与图像的离线控制任务中,我们的方法优于使用学习型奖励函数的基线模型;额外实验进一步证明了该方法在鲁棒性提升和数据集规模扩展方面的优势。