We use the lens of weak signal asymptotics to study a class of sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $n$ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the sample paths of a class of sequentially randomized experiments -- adapted to this scaling regime and with arm selection probabilities that vary continuously with state -- converge weakly to a diffusion limit, given as the solution to a stochastic differential equation. The diffusion limit enables us to derive refined, instance-specific characterization of stochastic dynamics, and to obtain several insights on the regret and belief evolution of a number of sequential experiments including Thompson sampling (but not UCB, which does not satisfy our continuity assumption). We show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from sub-optimal regret performance when the reward gaps are relatively large. Conversely, we find that a version of Thompson sampling with an asymptotically uninformative prior variance achieves near-optimal instance-specific regret scaling, including with large reward gaps, but these good regret properties come at the cost of highly unstable posterior beliefs.
翻译:本文通过弱信号渐近的视角研究一类序贯随机实验,包括求解多臂老虎机问题时产生的实验方法。在具有$n$个时间步长的实验中,我们令动作间平均奖励差距缩放至$1/\sqrt{n}$量级,以保持随$n$增长时学习任务的难度不变。在此框架下,我们证明一类适配该缩放机制、且臂选择概率随状态连续变化的序贯随机实验的样本路径弱收敛于扩散极限,该极限由随机微分方程的解给出。扩散极限使我们能够对随机动力学进行精细的实例特异性刻画,并获取关于若干序贯实验(包括汤普森采样,但不包括不满足连续性假设的UCB算法)的遗憾演化与信念更新的若干洞见。研究表明:当随机化概率对观测数据具有Lipschitz连续依赖关系时,所有序贯随机实验在奖励差距相对较大时均会遭受次优的遗憾表现。反之,我们发现在渐近无信息先验方差条件下,汤普森采样的变体能够实现近乎最优的实例特异性遗憾缩放(包括在奖励差距较大时),但这一优良遗憾特性是以极度不稳定的后验信念为代价的。