In view of its power in extracting feature representation, contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL), leading to efficient policy learning in various applications. Despite its tremendous empirical successes, the understanding of contrastive learning for RL remains elusive. To narrow such a gap, we study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. For both models, we propose to extract the correct feature representations of the low-rank model by minimizing a contrastive loss. Moreover, under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs. We further theoretically prove that our algorithm recovers the true representations and simultaneously achieves sample efficiency in learning the optimal policy and Nash equilibrium in MDPs and MGs. We also provide empirical studies to demonstrate the efficacy of the UCB-based contrastive learning method for RL. To the best of our knowledge, we provide the first provably efficient online RL algorithm that incorporates contrastive learning for representation learning. Our codes are available at https://github.com/Baichenjia/Contrastive-UCB.
翻译:鉴于其在提取特征表示方面的能力,对比自监督学习已成功集成到(深度)强化学习实践中,从而在各类应用中实现了高效的策略学习。尽管取得了巨大的实证成功,但强化学习中对比学习的理解仍不明确。为缩小这一差距,我们研究了在一类具有低秩转移的马尔可夫决策过程和马尔可夫博弈中,对比学习如何赋能强化学习。针对这两种模型,我们提出通过最小化对比损失来提取低秩模型的正确特征表示。此外,在在线设置下,我们提出了新颖的上置信界类算法,该类算法将这种对比损失与用于马尔可夫决策过程或马尔可夫博弈的在线强化学习算法相结合。我们进一步从理论上证明,我们的算法能够恢复真实表示,同时在马尔可夫决策过程和马尔可夫博弈中学习最优策略和纳什均衡时实现样本效率。我们还提供实证研究,以证明基于UCB的对比学习方法在强化学习中的有效性。据我们所知,我们首次提出了将对比学习用于表示学习的可证明高效的在线强化学习算法。我们的代码可访问 https://github.com/Baichenjia/Contrastive-UCB。