Statistical disparity between distinct treatment groups is one of the most significant challenges for estimating Conditional Average Treatment Effects (CATE). To address this, we introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals. Specifically, we utilize contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes. This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group. By augmenting the original dataset with these reliable imputations, we can effectively reduce the discrepancy between different treatment groups, while inducing minimal imputation error. The augmented dataset is subsequently employed to train CATE estimation models. Theoretical analysis and experimental studies on synthetic and semi-synthetic benchmarks demonstrate that our method achieves significant improvements in both performance and robustness to overfitting across state-of-the-art models.
翻译:不同处理组之间的统计差异是估计条件平均处理效应(CATE)时面临的最重大挑战之一。为解决这一问题,我们提出了一种模型无关的数据增强方法,该方法为选定的个体子集插补反事实结果。具体而言,我们利用对比学习来学习一个表示空间和相似性度量,使得在学到的表示空间中,由该相似性度量识别的邻近个体具有相似的可能结果。这一特性确保了为来自替代处理组的邻近个体进行可靠的反事实结果插补。通过将这些可靠的插补结果增强原始数据集,我们能够有效减少不同处理组之间的差异,同时引入最小的插补误差。增强后的数据集随后用于训练CATE估计模型。理论分析以及在合成和半合成基准上的实验研究表明,我们的方法在现有最优模型的性能和过拟合鲁棒性方面均取得了显著提升。