Variance-aware robust reinforcement learning with linear function approximation under heavy-tailed rewards

This paper presents two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of $\widetilde{O}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ as if the rewards were uniformly bounded, where $\nu_{t}^2$ is the observed conditional variance of the reward at round $t$, $d$ is the feature dimension, and $\widetilde{O}(\cdot)$ hides logarithmic dependence. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a tighter variance-aware regret bound of $\widetilde{O}(d\sqrt{HG^*K})$. Here, $H$ is the length of episodes, $K$ is the number of episodes, and $G^*$ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Our regret bound is superior to the current state-of-the-art bounds in three ways: (1) it depends on a tighter instance-dependent quantity and has optimal dependence on $d$ and $H$, (2) we can obtain further instance-dependent bounds of $G^*$ under additional structural conditions on the MDP, and (3) our regret bound is valid even when rewards have only finite variances, achieving a level of generality unmatched by previous works. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.

翻译：本文提出两种算法——AdaOFUL与VARA，用于处理仅具有有限方差的重尾奖励下的在线序贯决策问题。针对线性随机赌博机，我们通过修正自适应Huber回归并提出AdaOFUL来解决重尾奖励问题。AdaOFUL实现了当前最优的遗憾界 $\widetilde{O}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$，仿佛奖励具有一致有界性，其中 $\nu_{t}^2$ 为第 $t$ 轮观测到的奖励条件方差，$d$ 为特征维度，$\widetilde{O}(\cdot)$ 隐藏对数依赖。基于AdaOFUL，我们进一步提出针对线性MDP的VARA算法，实现了更紧凑的方差感知遗憾界 $\widetilde{O}(d\sqrt{HG^*K})$，其中 $H$ 为回合长度，$K$ 为回合数，$G^*$ 为更紧凑的实例相关量——当MDP满足额外结构条件时，该量可被其他实例相关量所界定。相比当前最优结果，我们的遗憾界在三个方面具有优势：(1) 依赖于更紧凑的实例相关量，且对 $d$ 和 $H$ 具有最优依赖；(2) 在MDP满足额外结构条件时，可进一步获得 $G^*$ 的实例相关界；(3) 即使奖励仅具有有限方差，我们的遗憾界仍然成立，达到了先前工作无法比拟的普适性。总体而言，我们修正的自适应Huber回归算法可作为处理重尾奖励在线问题的有效基础模块。