Policy Mirror Descent Inherently Explores Action Space

Designing computationally efficient exploration strategies for on-policy first-order methods that attain optimal $\mathcal{O}(1/\epsilon^2)$ sample complexity remains open for solving Markov decision processes (MDP). This manuscript provides an answer to this question from a perspective of simplicity, by showing that whenever exploration over the state space is implied by the MDP structure, there seems to be little need for sophisticated exploration strategies. We revisit a stochastic policy gradient method, named stochastic policy mirror descent, applied to the infinite horizon, discounted MDP with finite state and action spaces. Accompanying SPMD we present two on-policy evaluation operators, both simply following the policy for trajectory collection with no explicit exploration, or any form of intervention. SPMD with the first evaluation operator, named value-based estimation, tailors to the Kullback-Leibler (KL) divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an $\tilde{\mathcal{O}}( 1 / \epsilon^2)$ sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, named truncated on-policy Monte Carlo, attains an $\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}} / \epsilon^2)$ sample complexity, with the same assumption on the state chains of generated policies. We characterize $\mathcal{H}_{\mathcal{D}}$ as a divergence-dependent function of the effective horizon and the size of the action space, which leads to an exponential dependence of the latter two quantities for the KL divergence, and a polynomial dependence for the divergence induced by negative Tsallis entropy. These obtained sample complexities seem to be new among on-policy stochastic policy gradient methods without explicit explorations.

翻译：设计计算高效的在线策略一阶方法，使其在求解马尔可夫决策过程（MDP）时达到最优的$\mathcal{O}(1/\epsilon^2)$样本复杂度，至今仍是开放性问题。本文从简洁性视角回答该问题，证明当状态空间的探索由MDP结构隐含决定时，似乎无需复杂探索策略。我们重新审视一种随机策略梯度方法——随机策略镜像下降（SPMD），并将其应用于有限状态与动作空间的无限时域折扣MDP。伴随SPMD，我们提出两种在线策略评估算子，两者均直接沿策略采集轨迹，不进行显式探索或任何形式的干预。结合第一种评估算子（基于值函数估计）的SPMD适用于Kullback-Leibler（KL）散度。若生成策略的状态空间马尔可夫链具有均匀混合性且最小访问测度不衰减，则可获得$\tilde{\mathcal{O}}(1/\epsilon^2)$的样本复杂度，且与动作空间大小呈线性关系。结合第二种评估算子（截断在线策略蒙特卡洛）的SPMD在相同假设下达到$\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}}/\epsilon^2)$的样本复杂度。我们将$\mathcal{H}_{\mathcal{D}}$表征为有效视界与动作空间大小的散度依赖函数，该函数导致KL散度下后者呈指数依赖关系，而对于负Tsallis熵诱导的散度呈多项式依赖关系。上述样本复杂度在无显式探索的在线策略随机策略梯度方法中似为首创。