Precise Unbiased Estimation in Randomized Experiments using Auxiliary Observational Data

Randomized controlled trials (RCTs) are increasingly prevalent in education research, and are often regarded as a gold standard of causal inference. Two main virtues of randomized experiments are that they (1) do not suffer from confounding, thereby allowing for an unbiased estimate of an intervention's causal impact, and (2) allow for design-based inference, meaning that the physical act of randomization largely justifies the statistical assumptions made. However, RCT sample sizes are often small, leading to low precision; in many cases RCT estimates may be too imprecise to guide policy or inform science. Observational studies, by contrast, have strengths and weaknesses complementary to those of RCTs. Observational studies typically offer much larger sample sizes, but may suffer confounding. In many contexts, experimental and observational data exist side by side, allowing the possibility of integrating "big observational data" with "small but high-quality experimental data" to get the best of both. Such approaches hold particular promise in the field of education, where RCT sample sizes are often small due to cost constraints, but automatic collection of observational data, such as in computerized educational technology applications, or in state longitudinal data systems (SLDS) with administrative data on hundreds of thousand of students, has made rich, high-dimensional observational data widely available. We outline an approach that allows one to employ machine learning algorithms to learn from the observational data, and use the resulting models to improve precision in randomized experiments. Importantly, there is no requirement that the machine learning models are "correct" in any sense, and the final experimental results are guaranteed to be exactly unbiased. Thus, there is no danger of confounding biases in the observational data leaking into the experiment.

翻译：随机对照试验在教育研究中日益普遍，常被视为因果推断的黄金标准。随机实验的两大优势在于：（1）不受混杂因素影响，从而能够对干预措施的因果效应进行无偏估计；（2）允许基于设计的推断，即物理随机化过程能够充分保证统计假设的合理性。然而，随机对照试验的样本量通常较小，导致估计精度不足，在许多情况下其估计结果可能过于不精确，难以指导政策制定或推动科学进步。相比之下，观测性研究具有与随机对照试验互补的优劣势：通常能提供更大的样本量，但可能面临混杂问题。在许多场景中，实验数据与观测数据并存，这为整合“大数据观测数据”与“小规模但高质量的实验数据”创造了可能，从而实现优势互补。这类方法在教育领域尤具潜力——由于成本限制，该领域的随机对照试验样本量通常较小，但计算机化教育技术应用中的自动数据采集，以及覆盖数十万学生的州纵向数据系统（SLDS）等管理数据，使得丰富的高维观测数据得以广泛获取。本文提出一种方法框架：利用机器学习算法从观测数据中学习，并借助所得模型提升随机实验的估计精度。关键在于，该方法不要求机器学习模型具有任何意义上的“正确性”，且最终实验结果保证严格无偏。因此，观测数据中的混杂偏差不会渗透至实验估计结果中。