Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
翻译:生成丰富的经验流以促进更优学习,是强化学习中的一项基本挑战。解决该问题的一种方法是根据特定策略在长时间范围内选择探索性动作,这被称为"选项"。最近一系列利用图拉普拉斯算子特征函数推导此类探索性选项的工作具有重要意义。然而,迄今为止这些方法主要局限于表格型领域,其前提是:(1)图拉普拉斯矩阵已知或可完全估计,(2)对该矩阵进行特征分解在计算上可行,(3)值函数可精确学习。此外,这些方法还需一个独立的选项发现阶段。这些假设从根本上限制了方法的可扩展性。本文解决了上述局限性,展示了如何利用近期直接逼近拉普拉斯算子特征函数的研究成果,真正实现基于选项的探索性扩展。为此,我们提出一种全在线深度强化学习算法用于发现拉普拉斯算子选项,并在多种基于像素的任务上评估该方法。与多种最先进的探索方法对比表明,我们的方法在非平稳环境中具有有效性、通用性,并展现出显著潜力。