Estimating the effects of long-term treatments in A/B testing presents a significant challenge. Such treatments -- including updates to product functions, user interface designs, and recommendation algorithms -- are intended to remain in the system for a long period after their launches. On the other hand, given the constraints of conducting long-term experiments, practitioners often rely on short-term experimental results to make product launch decisions. It remains an open question how to accurately estimate the effects of long-term treatments using short-term experimental data. To address this question, we introduce a longitudinal surrogate framework. We show that, under standard assumptions, the effects of long-term treatments can be decomposed into a series of functions, which depend on the user attributes, the short-term intermediate metrics, and the treatment assignments. We describe the identification assumptions, the estimation strategies, and the inference technique under this framework. Empirically, we show that our approach outperforms existing solutions by leveraging two real-world experiments, each involving millions of users on WeChat, one of the world's largest social networking platforms.
翻译:在A/B测试中估计长期处理效应是一项重大挑战。此类处理——包括产品功能更新、用户界面设计改进及推荐算法调整——旨在发布后长期存在于系统中。然而,受限于长期实验的约束,从业者往往依赖短期实验结果来做出产品发布决策。如何利用短期实验数据准确估计长期处理效应仍是一个悬而未决的问题。为应对这一挑战,我们提出了一种纵向替代框架。研究表明,在标准假设下,长期处理效应可分解为一系列函数,这些函数取决于用户属性、短期中间指标和处理分配。我们阐述了该框架下的识别假设、估计策略及推断技术。通过利用全球最大社交平台之一微信上的两项真实实验(每项实验涉及数百万用户),我们实证证明了该方法优于现有解决方案。