Mediation analysis is widely used to disentangle causal pathways, yet in many real-world studies the mediator M and outcome Y are never jointly observed. This incompleteness breaks the standard identification strategy for natural direct and indirect effects. We introduce a novel data fusion framework that restores the identification by combining two incomplete data sources, one measuring $M$ and the other measuring Y. Our approach leverages shared instrumental variables (IVs) to circumvent the need to observe (M,Y) jointly, remains valid under unmeasured confounding via a no-interaction condition, and accommodates covariate and exposure shifts across data sources under a latent alignment condition. We establish two identification strategies, one for settings with a known set of valid IVs, and another for settings where valid IVs must be learned. We further develop semiparametric, influence-function-based estimators with multiple robustness properties, and propose an estimator that attains the semiparametric efficiency bound under appropriate conditions. We apply our framework to quantify the extent to which the effect of SNP rs610932 on dementia risk is mediated through immune-related gene-expression pathways.
翻译:摘要:中介分析广泛用于解析因果路径,然而在众多真实世界研究中,中介变量M与结局变量Y从未被联合观测。这一数据不完整性破坏了自然直接效应与间接效应的标准识别策略。我们提出了一种新颖的数据融合框架,通过整合两个不完整的数据源(一个测量M,另一个测量Y)来恢复识别能力。该方法利用共享的工具变量(IVs)规避联合观测(M,Y)的需求,在无交互作用条件下仍对未测量混杂因素保持有效性,并通过潜在对齐条件适应跨数据源的协变量与暴露偏移。我们建立了两种识别策略:其一适用于已知有效工具变量集的情形,其二适用于需要学习有效工具变量的情形。进一步开发了具有多重稳健性的基于影响函数的半参数估计量,并提出在适当条件下可达到半参数效率界的最优估计量。我们将该框架应用于量化SNP rs610932对痴呆风险的影响中由免疫相关基因表达通路介导的比例。