Semi-Parametric Identification and Estimation of Interaction and Effect Modification in Mixed Exposures using Stochastic Interventions

In many fields, including environmental epidemiology, researchers strive to understand the joint impact of a mixture of exposures. This involves analyzing a vector of exposures rather than a single exposure, with the most significant exposure sets being unknown. Examining every possible interaction or effect modification in a high-dimensional vector of candidates can be challenging or even impossible. To address this challenge, we propose a method for the automatic identification and estimation of exposure sets in a mixture with explanatory power, baseline covariates that modify the impact of an exposure and sets of exposures that have synergistic non-additive relationships. We define these parameters in a realistic nonparametric statistical model and use machine learning methods to identify variables sets and estimate nuisance parameters for our target parameters to avoid model misspecification. We establish a prespecified target parameter applied to variable sets when identified and use cross-validation to train efficient estimators employing targeted maximum likelihood estimation for our target parameter. Our approach applies a shift intervention targeting individual variable importance, interaction, and effect modification based on the data-adaptively determined sets of variables. Our methodology is implemented in the open-source SuperNOVA package in R. We demonstrate the utility of our method through simulations, showing that our estimator is efficient and asymptotically linear under conditions requiring fast convergence of certain regression functions. We apply our method to the National Institute of Environmental Health Science mixtures workshop data, revealing correct identification of antagonistic and agonistic interactions built into the data. Additionally, we investigate the association between exposure to persistent organic pollutants and longer leukocyte telomere length.

翻译：在许多领域，包括环境流行病学，研究者致力于理解混合暴露的联合影响。这涉及分析一个暴露向量而非单一暴露，而最重要的暴露集合是未知的。在高维候选变量中检验所有可能的交互作用或效应修正可能极具挑战甚至不可行。为解决这一问题，我们提出一种方法，用于自动识别和估计混合暴露中具有解释力的暴露集合、修正暴露影响的基线协变量以及具有协同非线性关系的暴露集合。我们在一个现实的非参数统计模型中定义这些参数，并使用机器学习方法识别变量集，同时估计目标参数的干扰参数以避免模型误设。我们建立了一个预设的目标参数，在识别出变量集后将其应用于其中，并使用交叉验证训练有效估计量，采用目标最大似然估计进行目标参数估计。我们的方法基于数据自适应确定的变量集，应用了针对个体变量重要性、交互作用和效应修正的移位干预。该方法已作为R语言开源SuperNOVA包实现。通过仿真实验，我们证明了该方法的有效性：在特定回归函数满足快速收敛条件时，估计量具有高效性和渐近线性性。我们将该方法应用于国家环境健康科学研究所混合物研讨会数据集，成功识别了数据中内置的拮抗与协同交互作用。此外，我们还探索了持久性有机污染物暴露与白细胞端粒长度延长之间的关联。