In the design stage of a randomized experiment, one way to ensure treatment and control groups exhibit similar covariate distributions is to randomize treatment until some prespecified level of covariate balance is satisfied. This experimental design strategy is known as rerandomization. Most rerandomization methods utilize balance metrics based on a quadratic form $v^TAv$ , where $v$ is a vector of covariate mean differences and $A$ is a positive semi-definite matrix. In this work, we derive general results for treatment-versus-control rerandomization schemes that employ quadratic forms for covariate balance. In addition to allowing researchers to quickly derive properties of rerandomization schemes not previously considered, our theoretical results provide guidance on how to choose the matrix $A$ in practice. We find the Mahalanobis and Euclidean distances optimize different measures of covariate balance. Furthermore, we establish how the covariates' eigenstructure and their relationship to the outcomes dictates which matrix $A$ yields the most precise mean-difference estimator for the average treatment effect. We find that the Euclidean distance is minimax optimal, in the sense that the mean-difference estimator's precision is never too far from the optimal choice, regardless of the relationship between covariates and outcomes. Our theoretical results are verified via simulation, where we find that rerandomization using the Euclidean distance has better performance in high-dimensional settings and typically achieves greater variance reduction to the mean-difference estimator than other quadratic forms.
翻译:在随机化实验的设计阶段,确保处理组和对照组具有相似协变量分布的一种方法是对处理进行随机化,直至达到预设的协变量平衡水平。这种实验设计策略被称为重随机化。大多数重随机化方法采用基于二次型 $v^TAv$ 的平衡度量,其中 $v$ 是协变量均值差异向量,$A$ 是半正定矩阵。本文推导了针对处理-对照组重随机化方案的一般性结论,这些方案使用二次型进行协变量平衡分析。除了使研究人员能够快速推导出先前未考虑的重随机化方案的性质外,我们的理论结果还提供了实践中如何选择矩阵 $A$ 的指导。我们发现马氏距离和欧氏距离分别优化了不同的协变量平衡度量。此外,我们揭示了协变量的特征结构及其与结果变量的关系如何决定哪种矩阵 $A$ 能产生最精确的平均处理效应均值差估计量。研究发现欧氏距离具有极小极大最优性,即无论协变量与结果变量之间关系如何,均值差估计量的精度始终接近最优选择。通过仿真验证理论结果时,我们发现使用欧氏距离的重随机化在高维场景下表现更优,且相比其他二次型,通常能更大程度地降低均值差估计量的方差。