Asymptotically Optimal Thompson Sampling Based Policy for the Uniform Bandits and the Gaussian Bandits

Thompson sampling (TS) for the parametric stochastic multi-armed bandits has been well studied under the one-dimensional parametric models. It is often reported that TS is fairly insensitive to the choice of the prior when it comes to regret bounds. However, this property is not necessarily true when multiparameter models are considered, e.g., a Gaussian model with unknown mean and variance parameters. In this paper, we first extend the regret analysis of TS to the model of uniform distributions with unknown supports. Specifically, we show that a switch of noninformative priors drastically affects the regret in expectation. Through our analysis, the uniform prior is proven to be the optimal choice in terms of the expected regret, while the reference prior and the Jeffreys prior are found to be suboptimal, which is consistent with previous findings in the model of Gaussian distributions. However, the uniform prior is specific to the parameterization of the distributions, meaning that if an agent considers different parameterizations of the same model, the agent with the uniform prior might not always achieve the optimal performance. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian distributions and the uniform distributions by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. The pre-processig of the posterior distribution is the key to TS-T, where we add an adaptive truncation procedure on the parameter space of the posterior distributions. Simulation results support our analysis, where TS-T shows the best performance in a finite-time horizon compared to other known optimal policies, while TS with the invariant priors performs poorly.

翻译：汤普森抽样（TS）在参数化随机多臂赌博机问题中已在一维参数模型下得到充分研究。通常认为TS在先验选择上对遗憾界具有相当程度的鲁棒性。然而当考虑多参数模型（例如均值与方差均未知的高斯模型）时，该性质未必成立。本文首先将TS的遗憾分析拓展至支持域未知的均匀分布模型。具体而言，我们证明非信息性先验的切换会显著影响期望遗憾值。通过分析证明，均匀先验是期望遗憾意义上的最优选择，而参考先验与Jeffreys先验被证实为次优，这与高斯分布模型中的发现一致。但均匀先验依赖于分布的具体参数化形式，这意味着若智能体考虑同一模型的不同参数化方案，采用均匀先验未必总能获得最优性能。有鉴于此，我们提出经微调的TS类策略——截断汤普森抽样（TS-T），该策略通过采用对一对一再参数化保持不变的参考先验与Jeffreys先验，可分别在高斯分布和均匀分布中实现渐近最优性。对后验分布的预处理是TS-T的关键技术环节，我们在后验分布的参数空间上引入了自适应截断过程。仿真结果支持我们的理论分析：在有限时域内TS-T相较于其他已知最优策略展现出最佳性能，而采用不变先验的原始TS表现欠佳。