We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits \emph{much} faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function $Q^*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q^*$-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the \emph{decision-making} problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply $O(1/n)$ regret rates in linear cases and $\exp(-\Omega(n))$ regret rates in tabular cases. We extend our findings to general function approximation by extending our results to regret guarantees based on $L_p$-convergence rates for estimating $Q^*$ rather than pointwise rates, where $L_2$ guarantees for nonparametric $Q^*$-estimation can be ensured under mild conditions.
翻译:我们研究了由固定行为策略生成的离线数据在无限时域折扣马尔可夫决策过程(MDP)中进行强化学习的遗憾问题。尽管现有常见方法(如拟合$Q$-迭代(FQI))的分析表明遗憾收敛率为$O(1/\sqrt{n})$,但实证表现却呈现出\emph{显著}更快的收敛。本文通过提供遗憾收敛的快速率,给出了精确刻画这一现象的精细遗憾分析。首先,我们证明:给定最优质量函数$Q^*$的任意估计,该估计所定义策略的遗憾收敛速率取决于$Q^*$估计逐点收敛速率的指数化,从而加速收敛。指数化的程度由\emph{决策}问题(而非估计问题)的噪声水平决定。我们以线性MDP和表格MDP为例建立了此类噪声水平。其次,我们给出了FQI和贝尔曼残差最小化的新分析,以建立正确的逐点收敛保证。作为特例,我们的结果表明线性情形下遗憾率为$O(1/n)$,表格情形下为$\exp(-\Omega(n))$。通过将结果从逐点收敛率推广到基于$L_p$收敛率的遗憾保证,我们将发现扩展至一般函数逼近场景——其中在温和条件下可确保非参数$Q^*$估计的$L_2$保证。