Using Auxiliary Data to Boost Precision in the Analysis of A/B Tests on an Online Educational Platform: New Data and New Results

Randomized A/B tests within online learning platforms represent an exciting direction in learning sciences. With minimal assumptions, they allow causal effect estimation without confounding bias and exact statistical inference even in small samples. However, often experimental samples and/or treatment effects are small, A/B tests are underpowered, and effect estimates are overly imprecise. Recent methodological advances have shown that power and statistical precision can be substantially boosted by coupling design-based causal estimation to machine-learning models of rich log data from historical users who were not in the experiment. Estimates using these techniques remain unbiased and inference remains exact without any additional assumptions. This paper reviews those methods and applies them to a new dataset including over 250 randomized A/B comparisons conducted within ASSISTments, an online learning platform. We compare results across experiments using four novel deep-learning models of auxiliary data and show that incorporating auxiliary data into causal estimates is roughly equivalent to increasing the sample size by 20\% on average, or as much as 50-80\% in some cases, relative to t-tests, and by about 10\% on average, or as much as 30-50\%, compared to cutting-edge machine learning unbiased estimates that use only data from the experiments. We show that the gains can be even larger for estimating subgroup effects, hold even when the remnant is unrepresentative of the A/B test sample, and extend to post-stratification population effects estimators.

翻译：在线学习平台中的随机A/B测试代表了学习科学领域一个令人兴奋的方向。在极少的假设下，它们能够无混杂偏差地估计因果效应，并在小样本中实现精确统计推断。然而，实验样本和/或处理效应通常较小，导致A/B测试统计功效不足，效应估计的精度过低。近年来的方法论进展表明，通过将基于设计的因果估计与来自未参与实验的历史用户的丰富日志数据的机器学习模型相结合，可以显著提升统计功效和精度。使用这些技术的估计仍然无偏，且推断仍然精确，无需额外假设。本文回顾了这些方法，并将其应用于一个新数据集，该数据集包含在线学习平台ASSISTments中进行的250多项随机A/B比较。我们使用四种新兴的辅助数据深度学习模型，跨实验比较结果，并表明：与t检验相比，将辅助数据纳入因果估计平均相当于将样本量增加约20%，在某些情况下甚至可达50-80%；与仅使用实验数据的前沿机器学习无偏估计相比，平均增加约10%，在某些情况下可达30-50%。我们证明，对于估计子组效应，这种增益甚至更大，且在剩余样本对A/B测试样本无代表性时依然成立，并扩展到后分层总体效应估计器。