Mastering multiple tasks through exploration and learning in an environment poses a significant challenge in reinforcement learning (RL). Unsupervised RL has been introduced to address this challenge by training policies with intrinsic rewards rather than extrinsic rewards. However, current intrinsic reward designs and unsupervised RL algorithms often overlook the heterogeneous nature of collected samples, thereby diminishing their sample efficiency. To overcome this limitation, in this paper, we propose a reward-free RL algorithm called \alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment and an uncertainty-weighted learning process to handle heterogeneous uncertainty in different samples. Theoretically, we show that in order to find an $\epsilon$-optimal policy, GFA-RFE needs to collect $\tilde{O} (H^2 \log N_{\mathcal F} (\epsilon) \mathrm{dim} (\mathcal F) / \epsilon^2 )$ number of episodes, where $\mathcal F$ is the value function class with covering number $N_{\mathcal F} (\epsilon)$ and generalized eluder dimension $\mathrm{dim} (\mathcal F)$. Such a result outperforms all existing reward-free RL algorithms. We further implement and evaluate GFA-RFE across various domains and tasks in the DeepMind Control Suite. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
翻译:通过探索和学习在环境中掌握多项任务是强化学习(RL)中的一个重大挑战。无监督强化学习被引入以应对这一挑战,其通过内在奖励而非外在奖励来训练策略。然而,当前的内在奖励设计和无监督强化学习算法常常忽视所收集样本的异质性,从而降低了其样本效率。为克服这一局限,本文提出了一种名为 GFA-RFE 的无奖励强化学习算法。我们算法的核心思想是采用一种不确定性感知的内在奖励来探索环境,并采用一种不确定性加权的学习过程来处理不同样本中异质性的不确定性。从理论上,我们证明为了找到 $\epsilon$ 最优策略,GFA-RFE 需要收集 $\tilde{O} (H^2 \log N_{\mathcal F} (\epsilon) \mathrm{dim} (\mathcal F) / \epsilon^2 )$ 个回合的样本,其中 $\mathcal F$ 是价值函数类,其覆盖数为 $N_{\mathcal F} (\epsilon)$,广义埃尔鲁德维度为 $\mathrm{dim} (\mathcal F)$。这一结果优于所有现有的无奖励强化学习算法。我们进一步在 DeepMind Control Suite 的各种领域和任务中实现并评估了 GFA-RFE。实验结果表明,GFA-RFE 的性能优于或与最先进的无监督强化学习算法相当。