Consider the problem of determining the effect of a compound on a specific cell type. To answer this question, researchers traditionally need to run an experiment applying the drug of interest to that cell type. This approach is not scalable: given a large number of different actions (compounds) and a large number of different contexts (cell types), it is infeasible to run an experiment for every action-context pair. In such cases, one would ideally like to predict the outcome for every pair while only having to perform experiments on a small subset of pairs. This task, which we label "causal imputation", is a generalization of the causal transportability problem. To address this challenge, we extend the recently introduced synthetic interventions (SI) estimator to handle more general data sparsity patterns. We prove that, under a latent factor model, our estimator provides valid estimates for the causal imputation task. We motivate this model by establishing a connection to the linear structural causal model literature. Finally, we consider the prominent CMAP dataset in predicting the effects of compounds on gene expression across cell types. We find that our estimator outperforms standard baselines, thus confirming its utility in biological applications.
翻译:考虑确定某种化合物对特定细胞类型影响的问题。传统上,研究人员需要针对该细胞类型开展施加目标药物的实验来回答这一问题。然而,这种方法缺乏可扩展性:面对大量不同动作(化合物)和大量不同情境(细胞类型),对所有动作-情境配对逐一进行实验并不可行。在此类情况下,理想方案是仅对少量配对进行实验,便能预测所有配对的结果。我们将此任务称为"因果插补",它是因果可迁移性问题的泛化形式。为应对这一挑战,我们扩展了近期提出的合成干预(SI)估计量,使其能处理更通用的数据稀疏模式。我们证明,在潜在因子模型下,该估计量能为因果插补任务提供有效估计。通过建立与线性结构因果模型文献的关联,我们进一步论证了该模型的合理性。最后,我们在著名的CMAP数据集中,预测化合物对基因表达(跨不同细胞类型)的影响。结果表明,我们的估计量优于标准基线方法,从而验证了其在生物学应用中的实用性。