The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems. Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.

翻译：汤普森采样（TS）以其在经典随机多臂赌博机问题中跨多种奖励模型的理论保障与卓越实证表现而闻名。然而，其最优性常受限于特定先验，因为普遍观察到TS在渐近遗憾界方面对先验选择相当不敏感。但当模型包含多个参数时，TS的最优性高度依赖于先验选择，这使先前结论对其他模型的普适性存疑。为填补这一空白，本研究探讨了无信息先验选择的影响，揭示了TS在处理缺乏理论理解的新模型时的表现。我们首先将TS的遗憾分析扩展到具有未知支撑的均匀分布模型——这应是最简单的非正则模型。研究结果表明，改变无信息先验会显著影响期望遗憾，这与先前在其他多参数赌博机模型中的发现一致。尽管均匀先验被证明是最优的，但需强调其最优性存在固有局限，仅适用于特定参数化形式，这凸显了先验不变性的重要性。基于这一局限，我们提出一种经轻微改进的基于TS的策略——截断汤普森采样（TS-T），通过采用在单射重参数化下具有不变性的参考先验与Jeffreys先验，该策略可在高斯模型与均匀模型中实现渐近最优性。该策略通过运用微调控的截断技术提供了一种实现最优性的替代方案，在实践中比寻找最优先验更为简便。