Thompson sampling: Precise arm-pull dynamics and adaptive inference

Adaptive sampling schemes are well known to create complex dependence that may invalidate conventional inference methods. A recent line of work shows that this need not be the case for UCB-type algorithms in multi-armed bandits. A central emerging theme is a `stability' property with asymptotically deterministic arm-pull counts in these algorithms, making inference as easy as in the i.i.d. setting. In this paper, we study the precise arm-pull dynamics in another canonical class of Thompson-sampling type algorithms. We show that the phenomenology is qualitatively different: the arm-pull count is asymptotically deterministic if and only if the arm is suboptimal or is the unique optimal arm; otherwise it converges in distribution to the unique invariant law of an SDE. This dichotomy uncovers a unifying principle behind many existing (in)stability results: an arm is stable if and only if its interaction with statistical noise is asymptotically negligible. As an application, we show that normalized arm means obey the same dichotomy, with Gaussian limits for stable arms and a semi-universal, non-Gaussian limit for unstable arms. This not only enables the construction of confidence intervals for the unknown mean rewards despite non-normality, but also reveals the potential of developing tractable inference procedures beyond the stable regime. The proofs rely on two new approaches. For suboptimal arms, we develop an `inverse process' approach that characterizes the inverse of the arm-pull count process via a Stieltjes integral. For optimal arms, we adopt a reparametrization of the arm-pull and noise processes that reduces the singularity in the natural SDE to proving the uniqueness of the invariant law of another SDE. We prove the latter by a set of analytic tools, including the parabolic Hörmander condition and the Stroock-Varadhan support theorem.

翻译：自适应采样方案因产生复杂的依赖关系而著称，这可能使传统推断方法失效。近期一系列研究表明，对于多臂赌博机中的UCB类算法，情况未必如此。一个核心的新兴主题是这些算法中臂选择次数具有渐近确定性的“稳定性”性质，使得推断变得与在独立同分布设置中一样简单。本文研究了另一类经典的汤普森采样型算法中精确的臂选择动态。我们证明其现象学在性质上有所不同：臂选择次数渐近确定当且仅当该臂是次优臂或是唯一的最优臂；否则其依分布收敛于一个随机微分方程的唯一不变律。这种二分法揭示了众多现有（不）稳定性结果背后的统一原理：一个臂是稳定的当且仅当其与统计噪声的相互作用在渐近意义上可忽略。作为应用，我们证明归一化的臂均值服从相同的二分法：稳定臂具有高斯极限，而不稳定臂具有半普适的非高斯极限。这不仅使得在非正态性条件下仍能构建未知均值奖励的置信区间成为可能，还揭示了在稳定区域之外开发可处理的推断程序的潜力。证明依赖于两种新方法。对于次优臂，我们发展了一种“逆过程”方法，通过斯蒂尔切斯积分刻画臂选择次数过程的逆过程。对于最优臂，我们采用臂选择过程与噪声过程的重新参数化，将自然随机微分方程中的奇异性问题转化为证明另一个随机微分方程不变律的唯一性。我们通过一套解析工具证明后者，包括抛物型赫尔曼德条件和斯特鲁克-瓦拉丹支撑定理。