We prove that Thompson sampling exhibits $\tilde{O}(σd \sqrt{T} + d r \sqrt{\mathrm{Tr}(Σ_0)})$ Bayesian regret in the linear-Gaussian bandit with a $\mathcal{N}(μ_0, Σ_0)$ prior distribution on the coefficients, where $d$ is the dimension, $T$ is the time horizon, $r$ is the maximum $\ell_2$ norm of the actions, and $σ^2$ is the noise variance. In contrast to existing regret bounds, this shows that to within logarithmic factors, the prior-dependent ``burn-in'' term $d r \sqrt{\mathrm{Tr}(Σ_0)}$ decouples additively from the minimax (long run) regret $σd \sqrt{T}$. Previous regret bounds exhibit a multiplicative dependence on these terms. We establish these results via a new ``elliptical potential'' lemma, and also provide a lower bound indicating that the burn-in term is unavoidable.
翻译:我们证明了在线性高斯赌博机中,当系数服从先验分布 $\mathcal{N}(μ_0, Σ_0)$ 时,Thompson 采样算法具有 $\tilde{O}(σd \sqrt{T} + d r \sqrt{\mathrm{Tr}(Σ_0)})$ 的贝叶斯遗憾,其中 $d$ 为维度,$T$ 为时间范围,$r$ 为动作的最大 $\ell_2$ 范数,$σ^2$ 为噪声方差。与现有遗憾界相比,这表明在对数因子范围内,先验依赖的“预热”项 $d r \sqrt{\mathrm{Tr}(Σ_0)}$ 与极小极大(长期)遗憾 $σd \sqrt{T}$ 以加法形式解耦。先前的遗憾界则表现出这些项之间的乘法依赖关系。我们通过一个新的“椭圆势”引理建立了这些结果,并提供了一个下界表明该预热项是不可避免的。