We use the lens of weak signal asymptotics to study a class of sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $n$ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the sample paths of a class of sequentially randomized experiments -- adapted to this scaling regime and with arm selection probabilities that vary continuously with state -- converge weakly to a diffusion limit, given as the solution to a stochastic differential equation. The diffusion limit enables us to derive refined, instance-specific characterization of stochastic dynamics, and to obtain several insights on the regret and belief evolution of a number of sequential experiments including Thompson sampling (but not UCB, which does not satisfy our continuity assumption). We show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from sub-optimal regret performance when the reward gaps are relatively large. Conversely, we find that a version of Thompson sampling with an asymptotically uninformative prior variance achieves near-optimal instance-specific regret scaling, including with large reward gaps, but these good regret properties come at the cost of highly unstable posterior beliefs.
翻译:本文利用弱信号渐近理论来研究一类序贯随机实验,包括求解多臂老虎机问题时产生的实验。在包含$n$个时间步的实验中,我们令各动作的平均奖励差距缩放至$1/\sqrt{n}$量级,从而在$n$增大时保持学习任务的难度不变。在该机制下,我们证明了一类适应此缩放机制且臂选择概率随状态连续变化的序贯随机实验的样本路径弱收敛至扩散极限,该极限由随机微分方程的解给出。扩散极限使我们能够推导随机动力学的精细化实例特异性刻画,并获得对包括汤普森采样(但不包括不满足连续性假设的UCB算法)在内的若干序贯实验中遗憾值和信念演化的深刻见解。我们发现,当随机化概率对观测数据具有利普希茨连续依赖性的序贯实验在奖励差距较大时均存在次优的遗憾性能。相反地,采用渐近无信息先验方差的汤普森采样变体能够实现近乎最优的实例特异性遗憾缩放(包括在大奖励差距情形下),但这种优良的遗憾性质是以高度不稳定的后验信念为代价的。