An Evaluation on Practical Batch Bayesian Sampling Algorithms for Online Adaptive Traffic Experimentation

To speed up online testing, adaptive traffic experimentation through multi-armed bandit algorithms is rising as an essential complementary alternative to the fixed horizon A/B testing. Based on recent research on best arm identification and statistical inference with adaptively collected data, this paper derives and evaluates four Bayesian batch bandit algorithms (NB-TS, WB-TS, NB-TTTS, WB-TTTS), which are combinations of two ways of weighting batches (Naive Batch and Weighted Batch) and two Bayesian sampling strategies (Thompson Sampling and Top-Two Thompson Sampling) to adaptively determine traffic allocation. These derived Bayesian sampling algorithms are practically based on summary batch statistics of a reward metric for pilot experiments, where one of the combination WB-TTTS in this paper seems to be newly discussed. The comprehensive evaluation on the four Bayesian sampling algorithms covers trustworthiness, sensitivity and regret of a testing methodology. Moreover, the evaluation includes 4 real-world eBay experiments and 40 reproducible synthetic experiments to reveal the learnings, which covers both stationary and non-stationary situations. Our evaluation reveals that, (a) There exist false positives inflation with equivalent best arms, while seldom discussed in literatures; (b) To control false positives, connections between convergence of posterior optimal probabilities and neutral posterior reshaping are discovered; (c) WB-TTTS shows competitive recall, higher precision, and robustness against non-stationary trend; (d) NB-TS outperforms on minimizing regret trials except on precision and robustness; (e) WB-TTTS is a promising alternative if regret of A/B Testing is affordable, otherwise NB-TS is still a powerful choice with regret consideration for pilot experiments.

翻译：为加速在线测试，通过多臂老虎机算法实现的自适应流量实验正逐渐成为固定时长A/B测试的重要补充方案。基于近期关于最优臂识别及自适应采集数据统计推断的研究，本文推导并评估了四种贝叶斯批处理老虎机算法（NB-TS、WB-TS、NB-TTTS、WB-TTTS），这些算法结合了两种批次加权方式（朴素批处理与加权批处理）和两种贝叶斯采样策略（汤普森采样与双顶层汤普森采样），以自适应确定流量分配。这些推导出的贝叶斯采样算法实际基于试点实验中奖励指标的批次统计摘要，其中本文提出的WB-TTTS组合似乎属于新近讨论的方案。针对四种贝叶斯采样算法的综合评估涵盖了测试方法的可信度、敏感度及遗憾值。此外，评估包含4个eBay真实实验和40个可复现的合成实验（涵盖平稳与非平稳场景）以揭示经验规律。我们的评估发现：(a) 存在等价最优臂下的伪阳性膨胀现象，而文献中鲜有讨论；(b) 为控制伪阳性，发现了后验最优概率收敛与中性后验重塑之间的关联；(c) WB-TTTS在召回率、精确度及非平稳趋势鲁棒性方面表现优异；(d) NB-TS在最小化遗憾试验中表现突出（除精确度与鲁棒性外）；(e) 若可承受A/B测试的遗憾成本，WB-TTTS是优良替代方案；否则在试点实验中，兼顾遗憾考量时NB-TS仍是强力选择。