Reinforcement Learning (RL) methods are typically applied directly in environments to learn policies. In some complex environments with continuous state-action spaces, sparse rewards, and/or long temporal horizons, learning a good policy in the original environments can be difficult. Focusing on the offline RL setting, we aim to build a simple and discrete world model that abstracts the original environment. RL methods are applied to our world model instead of the environment data for simplified policy learning. Our world model, dubbed Value Memory Graph (VMG), is designed as a directed-graph-based Markov decision process (MDP) of which vertices and directed edges represent graph states and graph actions, separately. As state-action spaces of VMG are finite and relatively small compared to the original environment, we can directly apply the value iteration algorithm on VMG to estimate graph state values and figure out the best graph actions. VMG is trained from and built on the offline RL dataset. Together with an action translator that converts the abstract graph actions in VMG to real actions in the original environment, VMG controls agents to maximize episode returns. Our experiments on the D4RL benchmark show that VMG can outperform state-of-the-art offline RL methods in several goal-oriented tasks, especially when environments have sparse rewards and long temporal horizons. Code is available at https://github.com/TsuTikgiau/ValueMemoryGraph
翻译:强化学习(RL)方法通常直接应用于环境中学习策略。在具有连续状态-动作空间、稀疏奖励和/或长时间跨度的复杂环境中,直接在原始环境中学习良好策略可能十分困难。聚焦于离线RL场景,我们旨在构建一个简单且离散的世界模型来抽象化原始环境。RL方法被应用于该世界模型而非环境数据,以简化策略学习。我们的世界模型名为价值记忆图(Value Memory Graph, VMG),它被设计为基于有向图的马尔可夫决策过程(MDP),其中顶点和有向边分别表示图状态和图动作。由于VMG的状态-动作空间是有限的且相比原始环境较小,我们可直接在VMG上应用值迭代算法来估计图状态价值并找出最优图动作。VMG通过离线RL数据集训练并构建而成。结合一个将VMG中的抽象图动作转换为原始环境中真实动作的动作翻译器,VMG能够控制智能体最大化回合累积奖励。在D4RL基准上的实验表明,VMG在多个面向目标的任务中可超越现有最优的离线RL方法,尤其是在环境具有稀疏奖励和长时间跨度的情况下。代码已开源至https://github.com/TsuTikgiau/ValueMemoryGraph