Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference (PII) method that adjusts for latent heterogeneity using negative control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects, which motivates our semiparametric inference method. Our method extends to projected direct effect estimands, accounting for hidden mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated with random forests through simulations and analysis of single-cell CRISPR perturbed datasets with potential unmeasured confounders.
翻译:数据整合方法旨在从高维结果中提取低维嵌入,以消除异构数据集间的不必要变异,例如批次效应和未测量协变量。然而,整合后的多重假设检验可能因数据依赖过程而产生偏差。我们提出一种稳健的后整合推断方法,该方法利用负控制结果调整潜在异质性。借助因果解释,我们推导出直接效应的非参数可识别性,从而启发了我们的半参数推断方法。我们的方法可扩展至投影直接效应估计量,同时考虑隐藏中介变量、混杂因子和调节因子。这些估计量在模型设定错误和嵌入存在误差的情况下仍保持统计意义。我们提供了偏差量化以及具有一致集中界的有限样本线性展开。所提出的双重稳健估计量在最小假设和潜在设定错误下具有一致性和有效性,便于通过机器学习算法进行数据自适应估计。我们通过随机森林模拟以及对存在潜在未测量混杂因子的单细胞CRISPR扰动数据集的分析来评估所提方法。