In the past two decades, AB testing has proliferated to optimise products in digital domains. Traditional AB tests use fixed-horizon testing, determining the sample size of the experiment and continuing until the experiment has concluded. However, due to the feedback provided by modern data infrastructure, experimenters may take incorrect decisions based on preliminary results of the test. For this reason, anytime-valid inference (AVI) is seeing increased adoption as the modern experimenters method for rapid decision making in the world of data streaming. This work focuses on Safe Testing, a novel framework for experimentation that enables continuous analysis without elevating the risk of incorrect conclusions. There exist safe testing equivalents of many common statistical tests, including the z-test, the t-test, and the proportion test. We compare the efficacy of safe tests against classical tests and another method for AVI, the mixture sequential probability ratio test (mSPRT). Comparisons are conducted first on simulation and then by real-world data from a large technology company, Vinted, a large European online marketplace for second-hand clothing. Our findings indicate that safe tests require fewer samples to detect significant effects, encouraging its potential for broader adoption.
翻译:在过去二十年中,AB测试在数字领域的产品优化中得到广泛应用。传统AB测试采用固定时间跨度检验,通过预先确定实验样本量并持续运行直至实验结束。然而,由于现代数据基础设施提供的实时反馈,实验者可能根据测试的初步结果做出错误决策。因此,作为现代实验者在数据流场景中实现快速决策的方法,随时有效推断正得到日益广泛的应用。本文聚焦于安全检验——一种能够进行持续分析同时避免错误结论风险升高的新型实验框架。许多常见统计检验(包括z检验、t检验和比例检验)均存在对应的安全检验形式。我们通过模拟实验和大型科技公司Vinted(欧洲最大的二手服装在线市场)的真实数据,比较了安全检验与传统检验以及另一种随时有效推断方法——混合序贯概率比检验的效能。研究结果表明,安全检验在检测显著效应时所需样本量更少,这为其更广泛的应用提供了有力支撑。