Adapting to a priori unknown noise level is a very important but challenging problem in sequential decision-making as efficient exploration typically requires knowledge of the noise level, which is often loosely specified. We report significant progress in addressing this issue for linear bandits in two respects. First, we propose a novel confidence set that is `semi-adaptive' to the unknown sub-Gaussian parameter $\sigma_*^2$ in the sense that the (normalized) confidence width scales with $\sqrt{d\sigma_*^2 + \sigma_0^2}$ where $d$ is the dimension and $\sigma_0^2$ is the specified sub-Gaussian parameter (known) that can be much larger than $\sigma_*^2$. This is a significant improvement over $\sqrt{d\sigma_0^2}$ of the standard confidence set of Abbasi-Yadkori et al. (2011), especially when $d$ is large or $\sigma_*^2=0$. We show that this leads to an improved regret bound in linear bandits. Second, for bounded rewards, we propose a novel variance-adaptive confidence set that has much improved numerical performance upon prior art. We then apply this confidence set to develop, as we claim, the first practical variance-adaptive linear bandit algorithm via an optimistic approach, which is enabled by our novel regret analysis technique. Both of our confidence sets rely critically on `regret equality' from online learning. Our empirical evaluation in diverse Bayesian optimization tasks shows that our proposed algorithms demonstrate better or comparable performance compared to existing methods.
翻译:适应先验未知的噪声水平是序列决策中一个非常重要但极具挑战性的问题,因为高效的探索通常需要噪声水平的知识,而该知识往往被粗略指定。我们在两个重要方面报告了针对线性赌博机中该问题的显著进展。首先,我们提出了一种新颖的置信集,它对未知的亚高斯参数 $\sigma_*^2$ 是"半自适应"的,其(归一化)置信宽度按 $\sqrt{d\sigma_*^2 + \sigma_0^2}$ 缩放,其中 $d$ 是维度,$\sigma_0^2$ 是指定的(已知)亚高斯参数,其值可能远大于 $\sigma_*^2$。这相较于 Abbasi-Yadkori 等人 (2011) 标准置信集的 $\sqrt{d\sigma_0^2}$ 是一个显著改进,尤其是在 $d$ 很大或 $\sigma_*^2=0$ 时。我们证明了这在线性赌博机中带来了改进的遗憾界。其次,对于有界奖励,我们提出了一种新颖的方差自适应置信集,其数值性能相较于现有技术有显著提升。随后,我们应用该置信集,通过乐观方法开发了据我们所知首个实用的方差自适应线性赌博机算法,这得益于我们新颖的遗憾分析技术。我们的两个置信集都关键依赖于在线学习中的"遗憾等式"。我们在多种贝叶斯优化任务中的实证评估表明,与现有方法相比,我们提出的算法展现出更好或相当的性能。