Typically, multi-armed bandit (MAB) experiments are analyzed at the end of the study and thus require the analyst to specify a fixed sample size in advance. However, in many online learning applications, it is advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. Existing work on continuous inference for adaptive experiments assumes that the treatment assignment probabilities are bounded away from zero and one, thus excluding nearly all standard bandit algorithms. In this work, we develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandits that enables continuous inference on the ATE with guarantees on statistical validity and power for nearly any bandit algorithm. On a high level, the MAD "mixes" a bandit algorithm of the user's choice with a Bernoulli design through a tuning parameter $\delta_t$, where $\delta_t$ is a deterministic sequence that controls the priority placed on the Bernoulli design as the sample size grows. We show that for $\delta_t = o\left(1/t^{1/4}\right)$, the MAD produces a confidence sequence that is asymptotically valid and guaranteed to shrink around the true ATE. We empirically show that the MAD improves the coverage and power of ATE inference in MAB experiments without significant losses in finite-sample reward.
翻译:通常,多臂赌博机(MAB)实验在研究结束时进行分析,因此要求分析者预先指定固定样本量。然而,在许多在线学习应用中,随着新数据的到来,持续生成臂间平均处理效应(ATE)的推断并确定基于数据的实验停止时间具有显著优势。现有关于自适应实验连续推断的研究假定处理分配概率远离零和一,从而排除了几乎所有标准赌博机算法。本文中,我们提出了混合自适应设计(MAD),这是一种面向多臂赌博机的新型实验设计,能够实现对ATE的连续推断,并保证几乎所有赌博机算法在统计有效性和统计功效方面的性能。从高层次看,MAD通过调优参数$\delta_t$将用户选择的赌博机算法与伯努利设计进行“混合”,其中$\delta_t$是一个确定性序列,控制着随着样本量增长伯努利设计的优先级。我们证明,当$\delta_t = o\left(1/t^{1/4}\right)$时,MAD生成的置信序列渐近有效且保证收敛至真实ATE。实验表明,MAD在不显著损失有限样本奖励的情况下,提升了MAB实验中ATE推断的覆盖率和统计功效。