We study the design of sample-efficient algorithms for reinforcement learning in the presence of rich, high-dimensional observations, formalized via the Block MDP problem. Existing algorithms suffer from either 1) computational intractability, 2) strong statistical assumptions that are not necessarily satisfied in practice, or 3) suboptimal sample complexity. We address these issues by providing the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level, with minimal statistical assumptions. Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics, a learning objective in which the aim is to predict the learner's own action from the current observation and observations in the (potentially distant) future. MusIK is simple and flexible, and can efficiently take advantage of general-purpose function approximation. Our analysis leverages several new techniques tailored to non-optimistic exploration algorithms, which we anticipate will find broader use.
翻译:我们研究在高维丰富观测环境下的强化学习样本高效算法设计问题,该问题通过分块MDP问题进行形式化建模。现有算法主要存在以下三类缺陷:1)计算不可行性;2)实际应用中难以满足的强统计假设;3)次优的样本复杂度。针对这些问题,我们提出了首个在最小统计假设下,关于目标精度达到最优样本复杂度的计算高效算法。该算法名为MusIK,它将系统性探索与基于多步逆运动学的表示学习相结合,其学习目标是通过当前观测和(可能相距较远的)未来观测来预测学习者自身动作。MusIK算法具有简单灵活的特性,能够高效利用通用函数逼近器。我们的理论分析揭示了若干针对非乐观探索算法的新技术,这些技术有望在更广泛的场景中得到应用。