Meta-Reinforcement Learning (Meta-RL) agents can struggle to operate across tasks with varying environmental features that require different optimal skills (i.e., different modes of behaviour). Using context encoders based on contrastive learning to enhance the generalisability of Meta-RL agents is now widely studied but faces challenges such as the requirement for a large sample size, also referred to as the $\log$-$K$ curse. To improve RL generalisation to different tasks, we first introduce Skill-aware Mutual Information (SaMI), an optimisation objective that aids in distinguishing context embeddings according to skills, thereby equipping RL agents with the ability to identify and execute different skills across tasks. We then propose Skill-aware Noise Contrastive Estimation (SaNCE), a $K$-sample estimator used to optimise the SaMI objective. We provide a framework for equipping an RL agent with SaNCE in practice and conduct experimental validation on modified MuJoCo and Panda-gym benchmarks. We empirically find that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks. Additionally, the context encoder trained with SaNCE demonstrates greater robustness to a reduction in the number of available samples, thus possessing the potential to overcome the $\log$-$K$ curse.
翻译:元强化学习(Meta-RL)智能体在处理具有不同环境特征且需要不同最优技能(即不同行为模式)的任务时可能面临困难。基于对比学习的上下文编码器被广泛研究用于提升元强化学习智能体的泛化能力,但仍面临诸如需要大量样本(即 $\log$-$K$ 困境)等挑战。为提升强化学习在不同任务间的泛化性能,我们首先提出了技能感知互信息(SaMI)优化目标,该目标有助于根据技能区分上下文嵌入,从而使强化学习智能体具备跨任务识别与执行不同技能的能力。随后,我们提出了技能感知噪声对比估计(SaNCE),这是一种用于优化 SaMI 目标的 $K$ 样本估计器。我们提出了一个在实际中为强化学习智能体配备 SaNCE 的框架,并在改进的 MuJoCo 和 Panda-gym 基准测试上进行了实验验证。实证结果表明,通过最大化 SaMI 进行学习的强化学习智能体在未见任务上实现了显著提升的零样本泛化能力。此外,使用 SaNCE 训练的上下文编码器在面对可用样本数量减少时表现出更强的鲁棒性,因此具备克服 $\log$-$K$ 困境的潜力。