We consider the problem of modeling the effects of perturbations, such as gene knockdowns or drugs, on measurements, such as single-cell RNA or protein counts. Given data for some perturbations, we aim to predict the distribution of measurements for new combinations of perturbations. To address this challenging extrapolation task, we posit that perturbations act additively in a suitable, unknown embedding space. We formulate the data-generating process as a latent variable model, in which perturbations amount to mean shifts in latent space and can be combined additively. We then prove that, given sufficiently diverse training perturbations, the representation and perturbation effects are identifiable up to orthogonal transformation and use this to characterize the class of unseen perturbations for which we obtain extrapolation guarantees. We establish a link between our model class and shift interventions in linear latent causal models. To estimate the model from data, we propose a new method, the perturbation distribution autoencoder (PDAE), which is trained by maximizing the distributional similarity between true and simulated perturbation distributions. The trained model can then be used to predict previously unseen perturbation distributions. Through simulations, we demonstrate that PDAE can accurately predict the effects of unseen but identifiable perturbations, supporting our theoretical results.
翻译:我们研究扰动(如基因敲低或药物处理)对测量数据(如单细胞RNA或蛋白质计数)影响的建模问题。在给定部分扰动观测数据的前提下,我们的目标是预测新扰动组合下测量数据的分布。针对这一具有挑战性的外推任务,我们提出假设:扰动在合适的未知嵌入空间中具有可加性作用。我们将数据生成过程构建为隐变量模型,其中扰动体现为隐空间中的均值平移,且可通过可加性进行组合。随后我们证明,在给定足够多样化的训练扰动条件下,表示与扰动效应在正交变换意义下是可识别的,并据此刻画了能够获得外推保证的未知扰动类别。我们建立了所提模型类与线性隐因果模型中平移干预的理论联系。为从数据中估计模型,我们提出一种新方法——扰动分布自编码器(PDAE),该方法通过最大化真实扰动分布与模拟扰动分布之间的分布相似性进行训练。训练后的模型可用于预测先前未观测到的扰动分布。通过仿真实验,我们证明PDAE能够准确预测未知但可识别扰动的影响,从而验证了理论结果的有效性。