Cross-Validated Decision Trees with Targeted Maximum Likelihood Estimation for Nonparametric Causal Mixtures Analysis

Exposure to mixtures of chemicals, such as drugs, pollutants, and nutrients, is common in real-world exposure or treatment scenarios. To understand the impact of these exposures on health outcomes, an interpretable and important approach is to estimate the causal effect of exposure regions that are most associated with a health outcome. This requires a statistical estimator that can identify these exposure regions and provide an unbiased estimate of a causal target parameter given the region. In this work, we present a methodology that uses decision trees to data-adaptively determine exposure regions and employs cross-validated targeted maximum likelihood estimation to unbiasedly estimate the average regional-exposure effect (ARE). This results in a plug-in estimator with an asymptotically normal distribution and minimum variance, from which confidence intervals can be derived. The methodology is implemented in the open-source software, CVtreeMLE, a package in R. Analysts put in a vector of exposures, covariates and an outcome and tables are given for regions in the exposures, such as lead > 2.1 & arsenic > 1.4, with an associated ARE which represents the mean outcome difference if all individuals were exposed to this region compared to if none were exposed to this region. CVtreeMLE enables researchers to discover interpretable exposure regions in mixed exposure scenarios and provides robust statistical inference for the impact of these regions. The resulting quantities offer interpretable thresholds that can inform public health policies, such as pollutant regulations, or aid in medical decision-making, such as identifying the most effective drug combinations.

翻译：在真实世界的暴露或治疗场景中，个体常常同时暴露于多种化学物质（如药物、污染物和营养素）的混合环境中。为理解这些混合暴露对健康结局的影响，一种可解释且重要的方法是估计与健康结局最相关的暴露区域的因果效应。这需要一种统计估计方法，既能识别这些暴露区域，又能针对该区域提供因果目标参数的无偏估计。本文提出一种方法：采用决策树自适应地确定暴露区域，并运用交叉验证的目标最大似然估计（cross-validated targeted maximum likelihood estimation）对区域平均暴露效应（ARE）进行无偏估计。该方法产生具有渐近正态分布和最小方差的插入估计量，可据此推导置信区间。该技术已实现为开源软件CVtreeMLE（R语言包）。分析人员输入暴露变量向量、协变量和结局后，系统将输出暴露区域表格（例如：铅>2.1 & 砷>1.4），并附对应的ARE值，该值表示全体个体暴露于该区域与无人暴露于该区域时的平均结局差异。CVtreeMLE使研究者能够在混合暴露情境中发现可解释的暴露区域，并为这些区域的效应提供稳健的统计推断。最终结果提供可解释的阈值，既能指导公共卫生政策（如污染物监管标准），也可辅助临床决策（如识别最有效的药物组合）。