Integrating Heterogeneous Information in Randomized Experiments: A Unified Calibration Framework

In modern randomized experiments, large-scale data collection increasingly yields rich baseline covariates and auxiliary information from multiple sources. Such information offers opportunities for more precise treatment effect estimation, but it also raises the challenge of integrating heterogeneous information coherently without compromising validity. Covariate-adaptive randomization (CAR) is widely used to improve covariate balance at the design stage, but it typically balances only a small set of covariates used to form strata, making covariate adjustment at the analysis stage essential for more efficient estimation of treatment effects. Beyond standard covariate adjustment, it is often desirable to incorporate auxiliary information, including cross-stratum information, predictions from various machine learning models, and external data from historical trials or real-world sources. While this auxiliary information is widely available, existing covariate adjustment methods under CAR primarily exploit within-stratum covariates and do not provide a coherent mechanism for integrating it. We propose a unified calibration framework that integrates such information through an information proxy vector and calibration weights defined by a convex optimization problem. The resulting estimator recovers many recent covariate adjustment procedures as special cases while providing a systematic mechanism for both internal and external information borrowing within a single framework. We establish large-sample validity and a no-harm efficiency guarantee, showing that incorporating additional information sources cannot increase asymptotic variance, and we extend the theory to settings in which both the number of strata and the number of information sources grow with the sample size.

翻译：在现代随机实验中，大规模数据收集日益产生丰富的基线协变量和来自多源的辅助信息。此类信息为更精确的估计处理效应提供了机会，但也带来了在不损害有效性的前提下连贯整合异质信息的挑战。协变量自适应随机化（CAR）在设计阶段被广泛用于改善协变量平衡，但其通常仅平衡用于形成层的一小部分协变量，这使得在分析阶段进行协变量调整对于更高效地估计处理效应至关重要。除了标准的协变量调整外，通常还需要纳入辅助信息，包括跨层信息、来自各种机器学习模型的预测，以及来自历史试验或真实世界数据源的外部数据。尽管这类辅助信息广泛可得，但现有CAR下的协变量调整方法主要利用层内协变量，并未提供整合此类信息的连贯机制。我们提出了一个统一的校准框架，该框架通过信息代理向量和由凸优化问题定义的校准权重来整合此类信息。所得估计量将许多近期协变量调整程序恢复为特例，同时在一个统一框架内为内部和外部信息借用提供了系统化机制。我们建立了大样本有效性及"无损害"效率保证，证明纳入额外信息源不会增加渐近方差，并将理论扩展到层数和信息源数量均随样本量增长的场景。