In a broad class of reinforcement learning applications, stochastic rewards have heavy-tailed distributions, which lead to infinite second-order moments for stochastic (semi)gradients in policy evaluation and direct policy optimization. In such instances, the existing RL methods may fail miserably due to frequent statistical outliers. In this work, we establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism, and correspondingly operated natural actor-critic (NAC), can be provably robustified against heavy-tailed reward distributions. It is shown in the framework of linear function approximation that a favorable tradeoff between bias and variability of the stochastic gradients can be achieved with this dynamic gradient clipping mechanism. In particular, we prove that robust versions of TD learning achieve sample complexities of order $\mathcal{O}(\varepsilon^{-\frac{1}{p}})$ and $\mathcal{O}(\varepsilon^{-1-\frac{1}{p}})$ with and without the full-rank assumption on the feature matrix, respectively, under heavy-tailed rewards with finite moments of order $(1+p)$ for some $p\in(0,1]$, both in expectation and with high probability. We show that a robust variant of NAC based on Robust TD learning achieves $\tilde{\mathcal{O}}(\varepsilon^{-4-\frac{2}{p}})$ sample complexity. We corroborate our theoretical results with numerical experiments.
翻译:在强化学习的一类广泛应用中,随机奖励具有重尾分布,这导致策略评估和直接策略优化中的随机(半)梯度具有无穷二阶矩。在此类情形下,现有强化学习方法可能因频繁的统计异常值而严重失效。本文证明,采用动态梯度裁剪机制的时序差分学习(Temporal Difference Learning, TD)及相应操作的自然演员-评论家(Natural Actor-Critic, NAC)可在理论上应对重尾奖励分布。在线性函数逼近框架下,我们表明该动态梯度裁剪机制能够实现随机梯度偏差与变异性之间的有利权衡。特别地,我们证明:在重尾奖励具有有限$(1+p)$阶矩(其中$p\in(0,1]$)的条件下,稳健时序差分学习在特征矩阵满秩假设下可实现$\mathcal{O}(\varepsilon^{-\frac{1}{p}})$的样本复杂度,在非满秩假设下可实现$\mathcal{O}(\varepsilon^{-1-\frac{1}{p}})$的样本复杂度(该结论在期望和依概率意义下均成立)。进一步,基于稳健时序差分学习的稳健自然演员-评论家算法可达到$\tilde{\mathcal{O}}(\varepsilon^{-4-\frac{2}{p}})$的样本复杂度。我们通过数值实验验证了理论结果。