Molecular signatures derived from omics data are increasingly used in epidemiological studies to characterize lifestyle exposures, either as proxies of exposure or to provide insight into disease mechanisms. These signatures are typically constructed by regressing the exposure on high-dimensional omics features. In the literature, an initial univariate screening step has sometimes been applied prior to multivariate modelling, but the causal implications of this choice have not yet been considered. Focusing on settings where the exposure causally influences molecular features (and not the reverse), we use directed acyclic graphs (DAGs) and $d$-separation arguments to show that collider bias may arise when the screening step is ignored, leading to the inclusion of non-causal features in the signature. We further demonstrate that the screening step can mitigate this bias. Our simulation studies illustrate that screening reduces the inclusion of non-causal features, albeit at the cost of lower sensitivity and reduced correlation between the exposure and the resulting signature. Overall, we recommend applying univariate screening prior to signature construction, particularly when the inclusion of non-causal features is undesirable, such as in mechanistic studies.
翻译:基于组学数据推导的分子特征越来越多地被用于流行病学研究中,以表征生活方式暴露——无论是作为暴露的替代指标,还是为揭示疾病机制提供洞见。这些特征通常通过将暴露变量对高维组学特征进行回归建模来构建。现有文献中,有时会在多元建模之前应用初始的单变量筛选步骤,但该选择在因果关系上的影响尚未被探讨。聚焦于暴露因果影响分子特征(而非反向关系)的情境,我们利用有向无环图(DAGs)和$d$-分离论据证明:当忽略筛选步骤时,可能产生碰撞偏倚,导致特征中包含非因果变量。我们进一步论证,筛选步骤可缓解该偏倚。模拟研究表明,筛选能减少非因果特征的纳入,尽管代价是灵敏度降低以及暴露变量与最终特征之间相关性的减弱。总体而言,我们建议在构建特征前应用单变量筛选,尤其在非因果特征不受欢迎的场景(如机制研究)中,这一策略尤为必要。