This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.
翻译:本文研究强化学习(RL)中的无奖励探索场景——即学习者在探索阶段无法获知奖励函数——并设计了一种改进现有技术水平的算法。具体而言,考虑一个具有$S$个状态、$A$个动作和$H$步时域的有限时域非齐次马尔可夫决策过程,并假设感兴趣的目标奖励函数数量不超过多项式规模。通过在不依赖奖励信息引导的情况下收集量级为\begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ 样本轨迹(忽略对数因子)} \end{align*}的数据,我们的算法能够为所有这些奖励函数找到$\varepsilon$最优策略,其中$\varepsilon$需充分小。这构成了该背景下首个可证明达到极小极大最优性的无奖励探索方案。此外,当样本量超过$\frac{S^2AH^3}{\varepsilon^2}$条轨迹(忽略对数因子)时,我们的算法能够对任意多个奖励函数(即使它们是对抗性设计的)给出$\varepsilon$精度解,该任务通常被称为“无奖励探索”。我们算法设计的创新性借鉴了离线强化学习的思路:探索方案试图最大化一个决定离线强化学习性能的关键无奖励指标,而策略学习范式则吸收了样本最优离线强化学习范式的思想。