Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standard A/B tests for static treatments, each run of a bandit algorithm over $T$ users produces only one dependent trajectory, because the algorithm's decisions depend on all past interactions. Reliable inference therefore demands many independent restarts of the algorithm, making experimentation costly and delaying deployment decisions. We propose Artificial Replay (AR) as a new experimental design for this problem. AR first runs one policy and records its trajectory. When the second policy is executed, it reuses a recorded reward whenever it selects an action the first policy already took, and queries the real environment only otherwise. We develop a new analytical framework for this design and prove three key properties of the resulting estimator: it is unbiased; it requires only $T + o(T)$ user interactions instead of $2T$ for a run of the treatment and control policies, nearly halving the experimental cost when both policies have sub-linear regret; and its variance grows sub-linearly in $T$, whereas the estimator from a naïve design has a linearly-growing variance. Numerical experiments with UCB, Thompson Sampling, and $ε$-greedy policies confirm these theoretical gains.
翻译:在线平台通常通过比较多臂老虎机算法(如UCB和汤普森采样)来选择性能最优的策略。与针对静态处理的标准A/B测试不同,算法在$T$个用户上的每次运行仅产生一条依赖轨迹,因为算法的决策依赖于所有历史交互。因此,可靠的推断需要算法的多次独立重启,这使得实验成本高昂并延迟部署决策。针对此问题,我们提出人工回放作为新的实验设计方案。AR首先运行一种策略并记录其轨迹;当执行第二种策略时,若其选择的动作与第一种策略已执行动作相同,则复用已记录的奖励,否则才向真实环境查询。我们为此设计开发了新的分析框架,并证明了所得估计量的三个关键性质:该估计量具有无偏性;仅需$T + o(T)$次用户交互而非两种策略运行所需的$2T$次,当两种策略均具有次线性遗憾时实验成本近乎减半;其方差在$T$上呈次线性增长,而朴素设计的估计量方差呈线性增长。采用UCB、汤普森采样和$ε$-贪心策略的数值实验验证了这些理论优势。