To speed up online testing, adaptive traffic experimentation through multi-armed bandit algorithms is rising as an essential complementary alternative to the fixed horizon A/B testing. Based on recent research on best arm identification and statistical inference with adaptively collected data, this paper derives and evaluates four Bayesian batch bandit algorithms (NB-TS, WB-TS, NB-TTTS, WB-TTTS), which are combinations of two ways of weighting batches (Naive Batch and Weighted Batch) and two Bayesian sampling strategies (Thompson Sampling and Top-Two Thompson Sampling) to adaptively determine traffic allocation. These derived Bayesian sampling algorithms are practically based on summary batch statistics of a reward metric for pilot experiments, where one of the combination WB-TTTS in this paper seems to be newly discussed. The comprehensive evaluation on the four Bayesian sampling algorithms covers trustworthiness, sensitivity and regret of a testing methodology. Moreover, the evaluation includes 4 real-world eBay experiments and 40 reproducible synthetic experiments to reveal the learnings, which covers both stationary and non-stationary situations. Our evaluation reveals that, (a) There exist false positives inflation with equivalent best arms, while seldom discussed in literatures; (b) To control false positives, connections between convergence of posterior optimal probabilities and neutral posterior reshaping are discovered; (c) WB-TTTS shows competitive recall, higher precision, and robustness against non-stationary trend; (d) NB-TS outperforms on minimizing regret trials except on precision and robustness; (e) WB-TTTS is a promising alternative if regret of A/B Testing is affordable, otherwise NB-TS is still a powerful choice with regret consideration for pilot experiments.
翻译:为加速在线测试,通过多臂老虎机算法实现的自适应流量实验正逐步成为固定周期A/B测试的关键互补方案。基于近期关于最优臂识别及自适应收集数据统计推断的研究,本文推导并评估了四种贝叶斯分批老虎机算法(NB-TS、WB-TS、NB-TTTS、WB-TTTS),这些算法由两种批次加权方式(朴素分批与加权分批)及两种贝叶斯采样策略(汤普森采样与双顶汤普森采样)组合而成,用于自适应确定流量分配。所推导的贝叶斯采样算法实际基于试点实验中奖励指标的批次汇总统计量,其中本文提出的WB-TTTS组合似乎为首次探讨。针对四种贝叶斯采样算法的全面评估涵盖测试方法的可信度、敏感度及遗憾值。此外,评估包含4项真实eBay实验及40项可复现合成实验以揭示学习结果,覆盖平稳与非平稳场景。本评估发现:(a)当存在等价最优臂时会出现假阳性膨胀问题,而文献中鲜有讨论;(b)为控制假阳性,揭示了后验最优概率收敛与中性后验重塑之间的关联;(c)WB-TTTS在召回率、精确度及非平稳趋势鲁棒性方面表现优异;(d)NB-TS在最小化遗憾实验方面表现突出,但在精确度与鲁棒性上存在不足;(e)若A/B测试的遗憾值可接受,WB-TTTS是极具前景的替代方案;若需重点考虑试点实验的遗憾值,NB-TS仍是强力选择。