In meta reinforcement learning (meta RL), an agent seeks a Bayes-optimal policy -- the optimal policy when facing an unknown task that is sampled from some known task distribution. Previous approaches tackled this problem by inferring a belief over task parameters, using variational inference methods. Motivated by recent successes of contrastive learning approaches in RL, such as contrastive predictive coding (CPC), we investigate whether contrastive methods can be used for learning Bayes-optimal behavior. We begin by proving that representations learned by CPC are indeed sufficient for Bayes optimality. Based on this observation, we propose a simple meta RL algorithm that uses CPC in lieu of variational belief inference. Our method, ContraBAR, achieves comparable performance to state-of-the-art in domains with state-based observation and circumvents the computational toll of future observation reconstruction, enabling learning in domains with image-based observations. It can also be combined with image augmentations for domain randomization and used seamlessly in both online and offline meta RL settings.
翻译:在元强化学习中,智能体旨在寻求一种贝叶斯最优策略——即面对从已知任务分布中采样的未知任务时的最优策略。以往方法通过变分推断推断任务参数的信念来解决此问题。受对比学习方法(如对比预测编码CPC)在强化学习领域近期成功的启发,我们探究对比方法能否用于学习贝叶斯最优行为。首先证明了CPC所学表征足以实现贝叶斯最优性。基于这一发现,我们提出一种简单的元强化学习算法,用CPC替代变分信念推断。我们的方法ContraBAR在基于状态观测的领域达到与最先进方法相当的性能,并规避了未来观测重建的计算开销,从而支持基于图像观测的领域学习。此外,该方法可结合图像增强实现领域随机化,并无缝适用于在线和离线元强化学习场景。