Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.
翻译:Thompson采样(TS)在随机多臂赌博机问题中被广泛使用,但其在自适应数据收集下的推断性质较为微妙。样本均值的经典渐近理论可能失效,因为各臂的样本量是随机的,且通过动作选择规则与奖励耦合。我们在$K$臂高斯赌博机中研究这一现象,并识别出\emph{乐观性}作为恢复\emph{稳定性}的关键机制——稳定性是有效渐近推断的充分条件,要求每个臂的拉动次数集中在一个确定性尺度附近。首先,我们证明方差膨胀TS \citep{halder2025stable}对于任意$K \ge 2$都是稳定的,包括多个臂均为最优的挑战性区域。这通过将\citet{halder2025stable}的结果从两臂设置推广到一般$K$臂设置,解决了他们提出的开放性问题。其次,我们分析了一种替代的乐观修正方法:该方法保持后验方差不变,但在后验均值上添加显式的均值奖励,并得出了相同的稳定性结论。总之,适当实施的乐观性能够稳定Thompson采样,并在多臂赌博机中实现渐近有效的推断,同时仅产生轻微的额外遗憾代价。