Participants in online experiments often enroll over time, which can compromise sample representativeness due to temporal shifts in covariates. This issue is particularly critical in A/B tests, online controlled experiments extensively used to evaluate product updates, since these tests are cost-sensitive and typically short in duration. We propose a novel framework that dynamically assesses sample representativeness by dividing the ongoing sampling process into three stages. We then develop stage-specific estimators for Population Average Treatment Effects (PATE), ensuring that experimental results remain generalizable across varying experiment durations. Leveraging survival analysis, we develop a heuristic function that identifies these stages without requiring prior knowledge of population or sample characteristics, thereby keeping implementation costs low. Our approach bridges the gap between experimental findings and real-world applicability, enabling product decisions to be based on evidence that accurately represents the broader target population. We validate the effectiveness of our framework on three levels: (1) through a real-world online experiment conducted on WeChat; (2) via a synthetic experiment; and (3) by applying it to 600 A/B tests on WeChat in a platform-wide application. Additionally, we provide practical guidelines for practitioners to implement our method in real-world settings.
翻译:在线实验的参与者通常随时间推移而陆续加入,这可能因协变量的时间性偏移而损害样本代表性。这一问题在A/B测试(广泛用于评估产品更新的在线对照实验)中尤为关键,因为此类测试对成本敏感且通常持续时间较短。我们提出了一种新颖框架,通过将持续抽样过程划分为三个阶段来动态评估样本代表性。随后,我们针对各阶段开发了特定的总体平均处理效应(PATE)估计量,确保实验结果在不同实验时长下均保持可推广性。借助生存分析技术,我们构建了一种启发式函数,无需预先掌握总体或样本特征即可识别这些阶段,从而保持较低的实施成本。我们的方法弥合了实验发现与现实适用性之间的鸿沟,使产品决策能够基于准确代表更广泛目标群体的证据。我们在三个层面验证了该框架的有效性:(1)通过在微信上开展的真实在线实验;(2)通过合成实验;(3)通过在微信平台范围内对600个A/B测试进行应用。此外,我们还为实践者提供了在实际场景中实施本方法的实用指南。