A major challenge in reinforcement learning is to develop practical, sample-efficient algorithms for exploration in high-dimensional domains where generalization and function approximation is required. Low-Rank Markov Decision Processes -- where transition probabilities admit a low-rank factorization based on an unknown feature embedding -- offer a simple, yet expressive framework for RL with function approximation, but existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions such as latent variable structure, access to model-based function approximation, or reachability. In this work, we propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs that is both computationally efficient and model-free, allowing for general function approximation and requiring no additional structural assumptions. Our algorithm, VoX, uses the notion of a barycentric spanner for the feature embedding as an efficiently computable basis for exploration, performing efficient barycentric spanner computation by interleaving representation learning and policy optimization. Our analysis -- which is appealingly simple and modular -- carefully combines several techniques, including a new approach to error-tolerant barycentric spanner computation and an improved analysis of a certain minimax representation learning objective found in prior work.
翻译:强化学习面临的主要挑战之一是在需要泛化和函数逼近的高维领域中开发实用、样本高效的探索算法。低秩马尔可夫决策过程(其中转移概率基于未知特征嵌入实现低秩分解)为带函数逼近的强化学习提供了简洁而富有表现力的框架,但现有算法要么(1)计算上不可行,要么(2)依赖严格的统计假设,如潜在变量结构、基于模型的函数逼近或可达性。本文提出了首个在低秩MDP中兼具计算高效性和无模型特性的可证明样本高效探索算法——该算法支持通用函数逼近,且无需额外结构假设。我们的算法VoX采用特征嵌入的重心张成空间作为可高效计算的探索基,通过交错执行表示学习与策略优化来实现高效重心张成空间计算。我们的分析——具有简洁且模块化的特点——巧妙结合了多种技术,包括一种新的容错重心张成空间计算方法,以及对先前工作中某极小极大表示学习目标的改进分析。