We study the Bayesian regret of the renowned Thompson Sampling algorithm in contextual bandits with binary losses and adversarially-selected contexts. We adapt the information-theoretic perspective of \cite{RvR16} to the contextual setting by considering a lifted version of the information ratio defined in terms of the unknown model parameter instead of the optimal action or optimal policy as done in previous works on the same setting. This allows us to bound the regret in terms of the entropy of the prior distribution through a remarkably simple proof, and with no structural assumptions on the likelihood or the prior. The extension to priors with infinite entropy only requires a Lipschitz assumption on the log-likelihood. An interesting special case is that of logistic bandits with $d$-dimensional parameters, $K$ actions, and Lipschitz logits, for which we provide a $\widetilde{O}(\sqrt{dKT})$ regret upper-bound that does not depend on the smallest slope of the sigmoid link function.
翻译:我们研究著名汤普森采样算法在具有二元损失和对抗性选择上下文的上下文赌博机中的贝叶斯遗憾。我们通过考虑一个基于未知模型参数(而非先前关于同一设置的工作中所采取的最优动作或最优策略)定义的信息比的提升版本,将\cite{RvR16}的信息论视角扩展到上下文设置。这使得我们可以通过一个极其简单的证明,以先验分布的熵为界来约束遗憾,且无需对似然函数或先验分布施加任何结构性假设。对于具有无限熵的先验分布,仅需对对数似然函数施加利普希茨假设。一个有趣的特殊情况是具有$d$维参数、$K$个动作和利普希茨对数几率逻辑的赌博机,我们为此提供了$\widetilde{O}(\sqrt{dKT})$的遗憾上界,该上界不依赖于sigmoid链接函数的最小斜率。