We study the stochastic linear bandits with parameter noise model, in which the reward of action $a$ is $a^\top θ$ where $θ$ is sampled i.i.d. We show a regret upper bound of $\widetilde{O} (\sqrt{d T \log (K/δ) σ^2_{\max})}$ for a horizon $T$, general action set of size $K$ of dimension $d$, and where $σ^2_{\max}$ is the maximal variance of the reward for any action. We further provide a lower bound of $\widetildeΩ (d \sqrt{T σ^2_{\max}})$ which is tight (up to logarithmic factors) whenever $\log (K) \approx d$. For more specific action sets, $\ell_p$ unit balls with $p \leq 2$ and dual norm $q$, we show that the minimax regret is $\widetildeΘ (\sqrt{dT σ^2_q)}$, where $σ^2_q$ is a variance-dependent quantity that is always at most $4$. This is in contrast to the minimax regret attainable for such sets in the classic additive noise model, where the regret is of order $d \sqrt{T}$. Surprisingly, we show that this optimal (up to logarithmic factors) regret bound is attainable using a very simple explore-exploit algorithm.
翻译:我们研究带参数噪声模型的随机线性赌博机,其中动作$a$的奖励为$a^\top θ$,而$θ$服从独立同分布采样。针对时间范围$T$、维度为$d$且规模为$K$的一般动作集,以及$\sigma^2_{\max}$表示任意动作奖励的最大方差,我们证明了$\widetilde{O} (\sqrt{d T \log (K/\delta) \sigma^2_{\max}})$的遗憾上界。进一步地,我们给出了$\widetilde{\Omega} (d \sqrt{T \sigma^2_{\max}})$的遗憾下界,该下界在$\log (K) \approx d$时(忽略对数因子)是紧致的。对于更具体的动作集——具有$p \leq 2$的$\ell_p$单位球及其对偶范数$q$,我们证明了极小极大遗憾为$\widetilde{\Theta} (\sqrt{dT \sigma^2_q})$,其中$\sigma^2_q$是一个方差相关量,其值始终不超过$4$。这与经典加性噪声模型中此类动作集可达到的极小极大遗憾(其阶数为$d \sqrt{T}$)形成鲜明对比。令人惊讶的是,我们证明这一最优(忽略对数因子)遗憾界可以通过一种非常简单的探索-利用算法实现。