Meta-reinforcement learning enables artificial agents to learn from related training tasks and adapt to new tasks efficiently with minimal interaction data. However, most existing research is still limited to narrow task distributions that are parametric and stationary, and does not consider out-of-distribution tasks during the evaluation, thus, restricting its application. In this paper, we propose MoSS, a context-based Meta-reinforcement learning algorithm based on Self-Supervised task representation learning to address this challenge. We extend meta-RL to broad non-parametric task distributions which have never been explored before, and also achieve state-of-the-art results in non-stationary and out-of-distribution tasks. Specifically, MoSS consists of a task inference module and a policy module. We utilize the Gaussian mixture model for task representation to imitate the parametric and non-parametric task variations. Additionally, our online adaptation strategy enables the agent to react at the first sight of a task change, thus being applicable in non-stationary tasks. MoSS also exhibits strong generalization robustness in out-of-distributions tasks which benefits from the reliable and robust task representation. The policy is built on top of an off-policy RL algorithm and the entire network is trained completely off-policy to ensure high sample efficiency. On MuJoCo and Meta-World benchmarks, MoSS outperforms prior works in terms of asymptotic performance, sample efficiency (3-50x faster), adaptation efficiency, and generalization robustness on broad and diverse task distributions.
翻译:元强化学习使得人工智能体能够从相关训练任务中学习,并利用最少的交互数据高效适应新任务。然而,现有研究大多仍局限于参数化且平稳的狭窄任务分布,且在评估过程中未考虑分布外任务,从而限制了其应用。本文提出MoSS——一种基于自监督任务表示学习的上下文元强化学习算法以应对这一挑战。我们将元强化学习扩展至此前从未探索过的广泛非参数化任务分布,并在非平稳及分布外任务中取得了最先进的成果。具体而言,MoSS由任务推理模块与策略模块组成。我们采用高斯混合模型进行任务表示,以模拟参数化与非参数化的任务变化。此外,我们的在线适应策略使智能体能够在任务变化发生时即刻做出反应,从而适用于非平稳任务。MoSS在分布外任务中还展现出强大的泛化鲁棒性,这得益于可靠且鲁棒的任务表示。策略基于离策略强化学习算法构建,整个网络完全采用离策略方式训练以确保高样本效率。在MuJoCo和Meta-World基准测试中,MoSS在渐进性能、样本效率(快3-50倍)、适应效率以及广泛多样任务分布上的泛化鲁棒性均超越了先前工作。