MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density features, i.e., the right matrix, which induce powerful models for state-occupancy distributions. This setting not only sheds light on leveraging unsupervised learning in RL, but also enables plug-in solutions for convex RL. In the offline setting, we propose an algorithm for off-policy estimation of occupancies that can handle non-exploratory data. Using this as a subroutine, we further devise an online algorithm that constructs exploratory data distributions in a level-by-level manner. As a central technical challenge, the additive error of occupancy estimation is incompatible with the multiplicative definition of data coverage. In the absence of strong assumptions like reachability, this incompatibility easily leads to exponential error blow-up, which we overcome via novel technical tools. Our results also readily extend to the representation learning setting, when the density features are unknown and must be learned from an exponentially large candidate set.
翻译:具有低秩转移矩阵的马尔可夫决策过程(即转移矩阵可分解为左、右两个矩阵的乘积)是一种高度代表性的结构,能够支持可处理的强化学习。其中左矩阵可实现基于价值学习的高效函数逼近,并已得到广泛研究。本研究则转而探索利用密度特征(即右矩阵)进行样本高效学习,该特征可构建用于状态占据分布的强大模型。这一设定不仅揭示了在强化学习中利用无监督学习的可能性,还能为凸强化学习提供即插即用的解决方案。在离线设定下,我们提出了一种可处理非探索性数据的占用率离策略估计算法。进一步地,以该算法为子程序,我们设计了一种逐层构建探索性数据分布的在线算法。其中关键的技术挑战在于:占用率的加性误差与数据覆盖率的乘性定义不兼容。在缺乏可达性等强假设的情况下,这种不兼容性极易导致指数级误差放大,而我们通过新型技术工具克服了这一难题。当密度特征未知且需从指数级候选集合中学习时,我们的结果也可直接推广至表示学习场景。