In this work, we introduce a new framework for active experimentation, the Prediction-Guided Active Experiment (PGAE), which leverages predictions from an existing machine learning model to guide sampling and experimentation. Specifically, at each time step, an experimental unit is sampled according to a designated sampling distribution, and the actual outcome is observed based on an experimental probability. Otherwise, only a prediction for the outcome is available. We begin by analyzing the non-adaptive case, where full information on the joint distribution of the predictor and the actual outcome is assumed. For this scenario, we derive an optimal experimentation strategy by minimizing the semi-parametric efficiency bound for the class of regular estimators. We then introduce an estimator that meets this efficiency bound, achieving asymptotic optimality. Next, we move to the adaptive case, where the predictor is continuously updated with newly sampled data. We show that the adaptive version of the estimator remains efficient and attains the same semi-parametric bound under certain regularity assumptions. Finally, we validate PGAE's performance through simulations and a semi-synthetic experiment using data from the US Census Bureau. The results underscore the PGAE framework's effectiveness and superiority compared to other existing methods.
翻译:本文提出了一种新的主动实验框架——预测引导主动实验(PGAE),该框架利用现有机器学习模型的预测来指导采样与实验过程。具体而言,在每一时间步,根据指定的采样分布选择一个实验单元,并基于实验概率观测其实际结果;否则仅能获得该结果的预测值。我们首先分析非自适应情形,该情形假设已完全掌握预测变量与实际结果的联合分布信息。针对此场景,我们通过最小化正则估计量类的半参数效率界,推导出最优实验策略,并构建了达到该效率界的估计量,从而实现了渐近最优性。随后,我们转向自适应情形,即预测变量会随着新采样数据持续更新。我们证明在特定正则性假设下,该估计量的自适应版本仍保持高效性,并达到相同的半参数效率界。最后,我们通过仿真实验以及基于美国人口普查局数据的半合成实验验证了PGAE的性能。实验结果凸显了PGAE框架相较于现有其他方法的有效性与优越性。