We use the lens of weak signal asymptotics to study a class of sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $n$ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the sample paths of a class of sequentially randomized experiments -- adapted to this scaling regime and with arm selection probabilities that vary continuously with state -- converge weakly to a diffusion limit, given as the solution to a stochastic differential equation. The diffusion limit enables us to derive refined, instance-specific characterization of stochastic dynamics, and to obtain several insights on the regret and belief evolution of a number of sequential experiments including Thompson sampling (but not UCB, which does not satisfy our continuity assumption). We show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from sub-optimal regret performance when the reward gaps are relatively large. Conversely, we find that a version of Thompson sampling with an asymptotically uninformative prior variance achieves near-optimal instance-specific regret scaling, including with large reward gaps, but these good regret properties come at the cost of highly unstable posterior beliefs.
翻译:本文利用弱信号渐近的视角研究了一类序列随机化实验,包括那些为解决多臂赌博机问题而产生的实验。在包含$n$个时间步的实验中,我们令各动作的平均奖励差距按$1/\sqrt{n}$量级缩放,从而在$n$增长时保持学习任务的难度。在该机制下,我们证明了一类序列随机化实验——其适应此缩放机制且臂选择概率随状态连续变化——的样本路径弱收敛于一个扩散极限,该极限由随机微分方程的解给出。扩散极限使我们能够推导出随机动力学的精细化、实例特异性刻画,并获得关于多种序列实验(包括汤普森采样,但不包括不满足连续性假设的UCB算法)的遗憾与信念演化的若干洞见。研究表明:当随机化概率对观测数据的依赖满足Lipschitz连续性时,所有序列实验在奖励差距相对较大时均存在次优的遗憾性能。相反,我们发现采用渐近无信息先验方差的汤普森采样变体,即使在奖励差距较大时也能实现近乎最优的实例特异性遗憾缩放,但这一良好的遗憾特性是以高度不稳定的后验信念为代价的。