Adaptive causal representation learning from observational data is presented, integrated with an efficient sample splitting technique within the semiparametric estimating equation framework. The support points sample splitting (SPSS), a subsampling method based on energy distance, is employed for efficient double machine learning (DML) in causal inference. The support points are selected and split as optimal representative points of the full raw data in a random sample, in contrast to the traditional random splitting, and providing an optimal sub-representation of the underlying data generating distribution. They offer the best representation of a full big dataset, whereas the unit structural information of the underlying distribution via the traditional random data splitting is most likely not preserved. Three machine learning estimators were adopted for causal inference, support vector machine (SVM), deep learning (DL), and a hybrid super learner (SL) with deep learning (SDL), using SPSS. A comparative study is conducted between the proposed SVM, DL, and SDL representations using SPSS, and the benchmark results from Chernozhukov et al. (2018), which employed random forest, neural network, and regression trees with a random k-fold cross-fitting technique on the 401(k)-pension plan real data. The simulations show that DL with SPSS and the hybrid methods of DL and SL with SPSS outperform SVM with SPSS in terms of computational efficiency and the estimation quality, respectively.
翻译:本文提出了从观测数据中学习自适应因果表示的方法,并将其与半参数估计方程框架内的高效样本分割技术相结合。采用基于能量距离的支撑点样本分割(SPSS)这一子抽样方法,以实现因果推断中高效的双重机器学习(DML)。支撑点被选取并分割为完整原始数据随机样本中的最优代表性点,这与传统的随机分割形成对比,并提供了对底层数据生成分布的最优子表示。它们能最佳地表示完整的大型数据集,而通过传统随机数据分割很可能无法保留底层分布的结构信息单元。研究采用三种机器学习估计器进行因果推断:支持向量机(SVM)、深度学习(DL)以及结合深度学习的混合超级学习器(SDL),均使用SPSS方法。我们对提出的基于SPSS的SVM、DL和SDL表示方法,与Chernozhukov等人(2018)使用随机森林、神经网络和回归树结合随机k折交叉拟合技术在401(k)养老金计划真实数据上获得的基准结果进行了比较研究。仿真结果表明,在计算效率方面,基于SPSS的DL方法优于基于SPSS的SVM;而在估计质量方面,基于SPSS的DL与SL混合方法表现更佳。