This paper presents two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of $\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ as if the rewards were uniformly bounded, where $\nu_{t}^2$ is the observed conditional variance of the reward at round $t$, $d$ is the feature dimension, and $\widetilde{\mathcal{O}}(\cdot)$ hides logarithmic dependence. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a tighter variance-aware regret bound of $\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$. Here, $H$ is the length of episodes, $K$ is the number of episodes, and $\mathcal{G}^*$ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Our regret bound is superior to the current state-of-the-art bounds in three ways: (1) it depends on a tighter instance-dependent quantity and has optimal dependence on $d$ and $H$, (2) we can obtain further instance-dependent bounds of $\mathcal{G}^*$ under additional structural conditions on the MDP, and (3) our regret bound is valid even when rewards have only finite variances, achieving a level of generality unmatched by previous works. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.
翻译:本文提出两种算法AdaOFUL与VARA,用于在仅有有限方差的重尾奖励下进行在线序列决策。针对线性随机赌博机问题,我们通过改进自适应Huber回归并提出AdaOFUL来处理重尾奖励。AdaOFUL实现了与奖励均匀有界时相同的遗憾界$\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$,其中$\nu_{t}^2$为第$t$轮观测到的奖励条件方差,$d$为特征维度,$\widetilde{\mathcal{O}}(\cdot)$隐藏了对数因子。基于AdaOFUL,我们进一步提出用于线性马尔可夫决策过程的VARA算法,该算法实现了更紧致的方差感知遗憾界$\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$,其中$H$为回合长度,$K$为回合数,$\mathcal{G}^*$为较小的实例依赖量,当MDP满足额外结构条件时,该量可通过其他实例依赖量进行界定。我们的遗憾界在三个方面优于现有最优结果:(1) 依赖更紧致的实例依赖量,且对$d$与$H$具有最优依赖关系;(2) 在MDP满足额外结构条件时,可进一步获得$\mathcal{G}^*$的实例依赖界;(3) 即使奖励仅具有有限方差,我们的遗憾界仍然成立,实现了先前工作无法达到的泛化水平。总体而言,改进的自适应Huber回归算法可作为设计重尾奖励在线问题算法的重要基础模块。