Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

翻译：尽管大量工作集中在设计针对均匀有界奖励的强化学习高效算法，但当奖励分布具有重尾特性——即仅存在有限$(1+\epsilon)$阶矩（其中$\epsilon\in(0,1]$）时，对于具有大规模状态-动作空间的强化学习是否存在样本或时间高效算法仍是开放问题。本文针对线性函数逼近下的此类奖励挑战展开研究。我们首先为重尾线性赌博机设计算法\textsc{Heavy-OFUL}，实现了$\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$的\textit{实例相关}$T$轮遗憾界，此为同类中\textit{首个}成果。此处$d$为特征维度，$\nu_t^{1+\epsilon}$表示第$t$轮奖励的$(1+\epsilon)$阶中心矩。我们进一步证明该界在随机与确定性线性赌博机的最坏情形实例中达到极小极大最优。随后将该算法扩展到线性函数逼近的强化学习设置，所提算法\textsc{Heavy-LSVI-UCB}实现了\textit{首个}计算高效的\textit{实例相关}$K$幕遗憾界$\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$，其中$H$为幕长度，$\mathcal{U}^*,\mathcal{V}^*$分别为与奖励和值函数中心矩相关的实例依赖量。我们同时给出匹配的极小极大下界$\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$，以证明算法在最坏情形下的最优性。该成果通过新型鲁棒自适应归一化浓度不等式实现，该不等式在处理一般在线回归问题中的重尾噪声时可能具有独立研究价值。