A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.
翻译:在线强化学习的核心问题之一是数据效率。尽管近期许多工作在在线强化学习中实现了渐近最优的遗憾值,但这些结果的最优性仅在“大样本”环境下得到保证,导致其算法需要大量“预热”成本才能达到最优性能。如何在无需任何预热成本的情况下实现极小极大最优遗憾值,一直是强化学习理论中的开放问题。本文针对有限时域非齐次马尔可夫决策过程解决了这一难题。具体而言,我们证明了一种改进版的单调价值传播(MVP)算法(由Zhang等人2020年提出)能够实现以下量级的遗憾值(对数因子除外):
\begin{equation*}
\min\big\{ \sqrt{SAH^3K}, \,HK \big\},
\end{equation*}
其中$S$表示状态数,$A$表示动作数,$H$表示规划时域,$K$表示总回合数。该遗憾值在全部样本量范围$K\geq 1$内均可匹配极小极大下界,本质上消除了任何预热需求。这同时转化为PAC样本复杂度(即达到$\varepsilon$精度所需的回合数)为$\frac{SAH^3}{\varepsilon^2}$(含对数因子),在完整的$\varepsilon$范围内达到极小极大最优。进一步地,我们将理论推广至揭示问题依赖量(如最优价值/成本及特定方差)的影响。关键技术突破在于开发了新的遗憾分解策略和去耦复杂统计依赖性的新型分析范式——而这一直是样本稀缺环境下在线强化学习分析面临的长期挑战。