In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.
翻译:在随机多臂老虎机问题中,一种被称为汤普森抽样(TS)的随机概率匹配策略在各种奖励模型中表现出卓越的性能。除了实证表现外,TS还被证明能在多种模型中达到渐近的问题依赖下界。然而,其最优性主要是在属于指数族的轻尾或单参数模型中得到解决。本文考虑了TS在具有重尾且由两个未知参数参数化的帕累托模型中的最优性。具体而言,我们讨论了使用包含Jeffreys先验和参考先验的概率匹配先验的TS的最优性。我们首先证明,使用特定概率匹配先验的TS能够达到最优遗憾界。然后,我们展示了使用其他先验(包括Jeffreys先验和参考先验)的TS的次优性。然而,我们发现,如果使用截断过程,使用Jeffreys先验和参考先验的TS能够达到渐近下界。这些结果表明,需谨慎选择无信息先验以避免次优性,并展示了截断过程在基于TS的策略中的有效性。