We initiate the mathematical study of replicability as an algorithmic property in the context of reinforcement learning (RL). We focus on the fundamental setting of discounted tabular MDPs with access to a generative model. Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions on i.i.d. samples drawn from the generator when its internal randomness is the same. We first provide an efficient $\rho$-replicable algorithm for $(\varepsilon, \delta)$-optimal policy estimation with sample and time complexity $\widetilde O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$, where $N$ is the number of state-action pairs. Next, for the subclass of deterministic algorithms, we provide a lower bound of order $\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$. Then, we study a relaxed version of replicability proposed by Kalavasis et al. [2023] called TV indistinguishability. We design a computationally efficient TV indistinguishable algorithm for policy estimation whose sample complexity is $\widetilde O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$. At the cost of $\exp(N)$ running time, we transform these TV indistinguishable algorithms to $\rho$-replicable ones without increasing their sample complexity. Finally, we introduce the notion of approximate-replicability where we only require that two outputted policies are close under an appropriate statistical divergence (e.g., Renyi) and show an improved sample complexity of $\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$.
翻译:我们启动了可复现性作为强化学习(RL)中算法属性的数学研究。聚焦于折扣表格型马尔可夫决策过程(MDP)这一基本设定,并假设可通过生成模型访问环境。受Impagliazzo等人[2022]启发,我们认为一个RL算法是可复现的,当且仅当在其内部随机性相同的情况下,基于生成器独立同分布样本的两轮执行有高概率输出完全相同的策略。首先,我们为$(\varepsilon, \delta)$-最优策略估计提供了一个高效的$\rho$-可复现算法,其样本与时间复杂度为$\widetilde O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$,其中$N$为状态-动作对数量。其次,针对确定性算法的子类,我们给出了$\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$阶的下界。进一步,我们研究了Kalavasis等人[2023]提出的可复现性松弛版本——全变差不可区分性,并设计了一种计算高效的TV不可区分策略估计算法,其样本复杂度为$\widetilde O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$。通过牺牲$\exp(N)$的运行时间,我们可在不增加样本复杂度的情况下将这些TV不可区分算法转化为$\rho$-可复现算法。最后,我们引入了近似可复现性的概念,仅要求输出策略在适当统计散度(例如,Renyi散度)下接近,并展示了改进后的样本复杂度$\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$。