We initiate the mathematical study of replicability as an algorithmic property in the context of reinforcement learning (RL). We focus on the fundamental setting of discounted tabular MDPs with access to a generative model. Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions on i.i.d. samples drawn from the generator when its internal randomness is the same. We first provide an efficient $\rho$-replicable algorithm for $(\varepsilon, \delta)$-optimal policy estimation with sample and time complexity $\widetilde O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$, where $N$ is the number of state-action pairs. Next, for the subclass of deterministic algorithms, we provide a lower bound of order $\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$. Then, we study a relaxed version of replicability proposed by Kalavasis et al. [2023] called TV indistinguishability. We design a computationally efficient TV indistinguishable algorithm for policy estimation whose sample complexity is $\widetilde O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$. At the cost of $\exp(N)$ running time, we transform these TV indistinguishable algorithms to $\rho$-replicable ones without increasing their sample complexity. Finally, we introduce the notion of approximate-replicability where we only require that two outputted policies are close under an appropriate statistical divergence (e.g., Renyi) and show an improved sample complexity of $\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$.
翻译:我们首次对强化学习(RL)中作为算法属性的可重复性进行了数学研究。我们聚焦于可访问生成模型的折扣表格型马尔可夫决策过程这一基础设定。受Impagliazzo等人[2022]的启发,若一个RL算法在内部随机性相同时,以高概率在两次独立同分布于生成器的样本执行中输出完全相同的策略,则称该算法是可重复的。我们首先为$(\varepsilon, \delta)$-最优策略估计提供了一种高效的$\rho$-可重复算法,其样本和时间复杂度为$\widetilde O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$,其中$N$是状态-动作对的数量。其次,对于确定性算法的子类,我们给出了阶为$\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$的下界。接着,我们研究了Kalavasis等人[2023]提出的可重复性的松弛版本——总变差不可区分性。我们设计了一种计算高效的TV不可区分策略估计算法,其样本复杂度为$\widetilde O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$。以$\exp(N)$运行时间为代价,我们可将这些TV不可区分算法转化为$\rho$-可重复算法,且不增加样本复杂度。最后,我们引入了近似可重复性概念,仅要求两个输出的策略在适当的统计散度(如Rényi散度)下接近,并证明了改进的样本复杂度$\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$。