Mediation with External Summary Statistic Information (MESSI)

Environmental health studies are increasingly measuring endogenous omics data ($\boldsymbol{M}$) to study intermediary biological pathways by which an exogenous exposure ($\boldsymbol{A}$) affects a health outcome ($\boldsymbol{Y}$), given confounders ($\boldsymbol{C}$). Mediation analysis is frequently carried out to understand such mechanisms. If intermediary pathways are of interest, then there is likely literature establishing statistical and biological significance of the total effect, defined as the effect of $\boldsymbol{A}$ on $\boldsymbol{Y}$ given $\boldsymbol{C}$. For mediation models with continuous outcomes and mediators, we show that leveraging external summary-level information on the total effect improves estimation efficiency of the natural direct and indirect effects. Moreover, the efficiency gain depends on the asymptotic partial $R^2$ between the outcome ($\boldsymbol{Y}\mid\boldsymbol{M},\boldsymbol{A},\boldsymbol{C}$) and total effect ($\boldsymbol{Y}\mid\boldsymbol{A},\boldsymbol{C}$) models, with smaller (larger) values benefiting direct (indirect) effect estimation. We robustify our estimation procedure to incongenial external information by assuming the total effect follows a random distribution. This framework allows shrinkage towards the external information if the total effects in the internal and external populations agree. We illustrate our methodology using data from the Puerto Rico Testsite for Exploring Contamination Threats, where Cytochrome p450 metabolites are hypothesized to mediate the effect of phthalate exposure on gestational age at delivery. External information on the total effect comes from a recently published pooled analysis of 16 studies. The proposed framework blends mediation analysis with emerging data integration techniques.

翻译：环境健康研究日益关注测量内源性组学数据($\boldsymbol{M}$)，以研究外源性暴露($\boldsymbol{A}$)在给定混杂因素($\boldsymbol{C}$)条件下影响健康结局($\boldsymbol{Y}$)的中间生物途径。调节分析常用于理解此类机制。若对中间途径感兴趣，则很可能已存在文献证实总效应（定义为给定$\boldsymbol{C}$时$\boldsymbol{A}$对$\boldsymbol{Y}$的影响）具有统计与生物学显著性。对于连续结局和连续中介变量的调节模型，我们证明利用关于总效应的外部汇总级信息可提高自然直接效应与自然间接效应的估计效率。此外，效率增益取决于结局模型（$\boldsymbol{Y}\mid\boldsymbol{M},\boldsymbol{A},\boldsymbol{C}$）与总效应模型（$\boldsymbol{Y}\mid\boldsymbol{A},\boldsymbol{C}$）之间的渐近偏$R^2$：较小（较大）值有利于直接（间接）效应估计的非参数化。我们通过假设总效应服从随机分布，将估计过程稳健化以应对不一致的外部信息。若内、外部人群的总效应一致，该框架可向外部信息收缩。我们利用波多黎各污染威胁探索测试场的数据进行方法验证，其中假设细胞色素p450代谢产物介导邻苯二甲酸酯暴露对分娩胎龄的影响。关于总效应的外部信息来自近期发表的一项涵盖16项研究的汇总分析。该框架将调节分析与新兴数据集成技术相结合。