Decoupling Meta-Reinforcement Learning with Gaussian Task Contexts and Skills

Offline meta-reinforcement learning (meta-RL) methods, which adapt to unseen target tasks with prior experience, are essential in robot control tasks. Current methods typically utilize task contexts and skills as prior experience, where task contexts are related to the information within each task and skills represent a set of temporally extended actions for solving subtasks. However, these methods still suffer from limited performance when adapting to unseen target tasks, mainly because the learned prior experience lacks generalization, i.e., they are unable to extract effective prior experience from meta-training tasks by exploration and learning of continuous latent spaces. We propose a framework called decoupled meta-reinforcement learning (DCMRL), which (1) contrastively restricts the learning of task contexts through pulling in similar task contexts within the same task and pushing away different task contexts of different tasks, and (2) utilizes a Gaussian quantization variational autoencoder (GQ-VAE) for clustering the Gaussian distributions of the task contexts and skills respectively, and decoupling the exploration and learning processes of their spaces. These cluster centers which serve as representative and discrete distributions of task context and skill are stored in task context codebook and skill codebook, respectively. DCMRL can acquire generalizable prior experience and achieve effective adaptation to unseen target tasks during the meta-testing phase. Experiments in the navigation and robot manipulation continuous control tasks show that DCMRL is more effective than previous meta-RL methods with more generalizable prior experience.

翻译：离线元强化学习方法通过利用先验经验适应未见过的目标任务，在机器人控制任务中至关重要。现有方法通常将任务上下文和技能作为先验经验，其中任务上下文与每个任务内部的信息相关，而技能则代表一组用于解决子任务的时序扩展动作。然而，这些方法在适应未见过的目标任务时仍面临性能受限的问题，主要原因在于学习的先验经验缺乏泛化性——即它们无法通过对连续隐空间的探索与学习，从元训练任务中提取有效的先验经验。为此，我们提出一种名为解耦元强化学习（DCMRL）的框架，该框架通过以下两点实现：（1）通过拉近同一任务内的相似任务上下文并推远不同任务间的任务上下文，对比性地约束任务上下文的学习；（2）利用高斯量化变分自编码器分别对任务上下文和技能的高斯分布进行聚类，并解耦其空间的探索与学习过程。这些聚类中心作为任务上下文和技能的典型离散分布，分别存储于任务上下文码本和技能码本中。DCMRL能够获取可泛化的先验经验，并在元测试阶段有效适应未见过的目标任务。在导航和机器人操控连续控制任务中的实验表明，相比先前的元强化学习方法，DCMRL凭借更优泛化能力的先验经验展现出更高的有效性。