Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$ regret, where $\sigma_k^2$ is the variance of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.

翻译：近来多项研究（Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022）为线性上下文赌博机提供了方差依赖的遗憾界，该界限在最坏情形与确定性奖励情形之间插值。然而，这些方法要么计算上不可行，要么无法处理未知的噪声方差。本文针对这一开放问题提出创新性解决方案，首次提出适用于异方差噪声的线性赌博机高效计算算法。该算法能自适应未知噪声方差，实现$\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$的遗憾界，其中$\sigma_k^2$为第$k$轮的噪声方差，$d$为上下文维度，$K$为总轮数。我们的结果基于自适应方差感知置信集，这得益于针对自归一化鞅的新型Freedman型集中不等式，以及将上下文向量按不确定性均匀上界分层为不同层次的多层结构。此外，本方法可推广至强化学习中的线性混合马尔可夫决策过程（MDP）。我们提出针对线性混合MDP的方差自适应算法，该算法获得问题依赖的无水平遗憾界，可优雅降阶为确定性MDP的近常数遗憾。与现有近极小极大最优的线性混合MDP算法不同，本方法无需显式估计转移概率的方差，也无需使用高阶矩估计器来获得无水平遗憾界。我们相信本文发展的技术对一般在线决策问题具有独立价值。