The ability to learn good representations of states is essential for solving large reinforcement learning problems, where exploration, generalization, and transfer are particularly challenging. The Laplacian representation is a promising approach to address these problems by inducing informative state encoding and intrinsic rewards for temporally-extended action discovery and reward shaping. To obtain the Laplacian representation one needs to compute the eigensystem of the graph Laplacian, which is often approximated through optimization objectives compatible with deep learning approaches. These approximations, however, depend on hyperparameters that are impossible to tune efficiently, converge to arbitrary rotations of the desired eigenvectors, and are unable to accurately recover the corresponding eigenvalues. In this paper we introduce a theoretically sound objective and corresponding optimization algorithm for approximating the Laplacian representation. Our approach naturally recovers both the true eigenvectors and eigenvalues while eliminating the hyperparameter dependence of previous approximations. We provide theoretical guarantees for our method and we show that those results translate empirically into robust learning across multiple environments.
翻译:学习良好的状态表征对于解决大规模强化学习问题至关重要,其中探索、泛化和迁移尤为困难。拉普拉斯表征通过生成信息丰富的状态编码和内在奖励,为时间扩展动作发现和奖励塑形提供了一种有前景的方法。要获得拉普拉斯表征,需计算图拉普拉斯算子的特征系统,这通常通过兼容深度学习方法的优化目标来近似。然而,这些近似依赖于无法高效调整的超参数,会收敛到所需特征向量的任意旋转,且无法准确恢复相应特征值。本文提出一种理论上严谨的目标函数及相应的优化算法来近似拉普拉斯表征。我们的方法自然地恢复了真实特征向量和特征值,同时消除了先前近似方法对超参数的依赖。我们为该方法的理论保证提供了证明,并展示了这些结果在多个环境中稳健学习的实证转化。