In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.
翻译:在线学习问题中,利用低方差特性对于获得紧致性能保证具有重要作用,但方差通常先验未知,这带来了挑战。近期,Zhang等人(2021)取得了重大进展,他们在未知方差条件下获得了线性赌博机的方差自适应遗憾界,以及线性混合马尔可夫决策过程的无地平线遗憾界。本文提出了显著改进其遗憾界的新颖分析方法。对于线性赌博机,我们实现了$\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$的遗憾界,其中$d$为特征维度,$K$为时间视界,$\sigma_k^2$为第$k$时间步的噪声方差,$\tilde O$忽略多项式对数依赖,该结果在$d^3$因子量级上得到改进。对于每个回合最大累积奖励假设在$[0,1]$范围内的线性混合马尔可夫决策过程,我们实现了$\tilde O(d \sqrt{K} + d^2)$的无地平线遗憾界,其中$d$为基础模型数量,$K$为回合数。该结果在主导项上改进了$d^{3.5}$因子,在低阶项上改进了$d^7$因子。我们的分析关键依赖于一种新颖的剥壳式遗憾分析方法,该方法利用了椭圆势能"计数"引理。