We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP. In both cases, each entry of the matrix carries important information, and we seek estimation methods with low entry-wise error. Importantly, these methods further need to accommodate for inherent correlations in the available data (e.g. for MDPs, the data consists of system trajectories). We investigate the performance of simple spectral-based matrix estimation approaches: we show that they efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error. These new results on low-rank matrix estimation make it possible to devise reinforcement learning algorithms that fully exploit the underlying low-rank structure. We provide two examples of such algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for reward-free RL in low-rank MDPs. Both algorithms yield state-of-the-art performance guarantees.
翻译:我们研究强化学习中具有低秩结构矩阵估计问题。在低秩老虎机中,待恢复矩阵指定了期望的臂收益值;对于低秩马尔可夫决策过程,该矩阵可用于刻画MDP的转移核。在这两种情况下,矩阵的每个元素均承载重要信息,因此需要发展具有低逐元素误差的估计方法。尤为关键的是,这些方法还需适应可用数据中固有的相关性(例如在MDP中,数据由系统轨迹构成)。本文考察了基于谱的简单矩阵估计方法的性能:我们证明这些方法能够高效恢复矩阵的奇异子空间,并实现近乎最小的逐元素误差。这些关于低秩矩阵估计的新成果使得设计充分挖掘底层低秩结构的强化学习算法成为可能。我们提供了两类此类算法示例:面向低秩老虎机问题的遗憾最小化算法,以及面向低秩MDP中无奖励强化学习的最优策略识别算法。两种算法均获得了业界领先的性能保证。