Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.
翻译:基于模型的强化学习因其样本效率而展现出巨大潜力,但在处理长时域稀疏奖励任务时仍存在困难,尤其是在智能体从固定数据集中学习的离线场景中。我们假设基于模型的强化学习智能体在这些环境中表现不佳是由于缺乏长期规划能力,而在环境的时域抽象模型中进行规划可以缓解这一问题。本文做出两项关键贡献:1)我们提出了一种离线基于模型的强化学习算法IQL-TD-MPC,它通过隐式Q学习扩展了最先进的模型预测控制时域差分学习;2)我们提议将IQL-TD-MPC作为分层设置中的管理器,与任何现成的离线强化学习算法(作为工作器)配合使用。具体而言,我们预训练了一个时域抽象的IQL-TD-MPC管理器,通过规划来预测大致对应于子目标的"意图嵌入"。实验表明,用IQL-TD-MPC管理器生成的意图嵌入增强状态表示,能显著提升现成离线强化学习智能体在最具挑战性的D4RL基准任务中的表现。例如,离线强化学习算法AWAC、TD3-BC、DT和CQL在中等和大型antmaze任务上的归一化评估分数均为零或接近零,而我们的改进方法使平均分数超过40。