This paper offers a detailed investigation of switchback designs in A/B testing, which alternate between baseline and new policies over time. Our aim is to thoroughly evaluate the effects of these designs on the accuracy of their resulting average treatment effect (ATE) estimators. We propose a novel "weak signal analysis" framework, which substantially simplifies the calculations of the mean squared errors (MSEs) of these ATEs in Markov decision process environments. Our findings suggest that (i) when the majority of reward errors are positively correlated, the switchback design is more efficient than the alternating-day design which switches policies in a daily basis. Additionally, increasing the frequency of policy switches tends to reduce the MSE of the ATE estimator. (ii) When the errors are uncorrelated, however, all these designs become asymptotically equivalent. (iii) In cases where the majority of errors are negative correlated, the alternating-day design becomes the optimal choice. These insights are crucial, offering guidelines for practitioners on designing experiments in A/B testing. Our analysis accommodates a variety of policy value estimators, including model-based estimators, least squares temporal difference learning estimators, and double reinforcement learning estimators, thereby offering a comprehensive understanding of optimal design strategies for policy evaluation in reinforcement learning.
翻译:本文对A/B测试中的切换设计进行了详细研究,该设计随时间交替使用基准策略和新策略。我们的目标是全面评估这些设计对其平均处理效应(ATE)估计量准确性的影响。我们提出了一种新颖的"弱信号分析"框架,该框架在马尔可夫决策过程环境中显著简化了ATE均方误差(MSE)的计算。我们的研究结果表明:(i)当多数奖励误差呈正相关时,切换设计比每日切换策略的交替日设计更高效;此外,增加策略切换频率通常会降低ATE估计量的MSE;(ii)当误差不相关时,所有设计渐近等价;(iii)当多数误差呈负相关时,交替日设计成为最优选择。这些见解至关重要,可为实践者在A/B测试中的实验设计提供指导。我们的分析适用于多种策略价值估计量,包括基于模型的估计量、最小二乘时序差分学习估计量和双重强化学习估计量,从而为强化学习中的策略评估提供了对最优设计策略的全面理解。