Policy Mirror Descent Inherently Explores Action Space

Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, namely truncated on-policy Monte Carlo (TOMC), attains an $\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}}/\epsilon^2)$ sample complexity, where $\mathcal{H}_{\mathcal{D}}$ mildly depends on the effective horizon and the size of the action space with properly chosen Bregman divergence (e.g., Tsallis divergence). SPMD with TOMC also exhibits stronger convergence properties in that it controls the optimality gap with high probability rather than in expectation. In contrast to explicit exploration, these new policy gradient methods can prevent repeatedly committing to potentially high-risk actions when searching for optimal policies.

翻译：显式探索动作空间曾被认为是在线策略梯度方法不可或缺的环节，以避免在求解有限状态和动作空间的通用强化学习问题时样本复杂度的急剧恶化。本文首次证明，在不引入任何探索策略的情况下，在线策略梯度方法仍能达到$\tilde{\mathcal{O}}(1/\epsilon^2)$的样本复杂度。关键进展包括提出两种新的在线策略评估算子，以及对随机策略镜像下降方法（SPMD）的创新性分析。第一种评估算子称为基于值估计，针对Kullback-Leibler散度优化，适用于生成策略的状态空间马尔可夫链均匀混合且最小访问度量非衰减的场景，此时样本复杂度$\tilde{\mathcal{O}}(1/\epsilon^2)$对动作空间规模呈线性依赖。第二种评估算子称为截断在线策略蒙特卡洛方法（TOMC），通过合理选择Bregman散度（如Tsallis散度），可实现$\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}}/\epsilon^2)$的样本复杂度，其中$\mathcal{H}_{\mathcal{D}}$对有效视界和动作空间规模的依赖较弱。基于TOMC的SPMD方法具有更强的收敛性，能以高概率而非期望形式控制最优性间隙。与显式探索不同，这些新策略梯度方法可在搜索最优策略时避免反复尝试潜在高风险动作。