Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
翻译:在强化学习中,选择能够生成丰富经验流以促进更好学习的探索性动作是一项基本挑战。解决该问题的一种方法是根据特定策略在较长时间段内选择动作,即所谓的选项。近期一系列基于图拉普拉斯算子特征函数推导探索性选项的研究取得了进展。值得注意的是,迄今为止这些方法主要局限于表格型领域,其假设条件为:(1) 图拉普拉斯矩阵要么已给定,要么可完全估计;(2) 对该矩阵进行特征分解在计算上可行;(3) 值函数可精确学习。此外,这些方法还需要独立的选项发现阶段。这些假设本质上无法扩展至大规模问题。本文中我们突破了这些局限性,展示了如何利用近期直接近似拉普拉斯特征函数的研究成果,真正实现基于选项的探索方法可扩展化。为此,我们提出了一种全在线深度强化学习算法用于发现基于拉普拉斯算子的选项,并在多种像素级任务上评估了该方法。与多种最新探索方法相比,我们的方法在非平稳环境中表现出高效性、通用性和显著优势。