High-dimensional data often exhibit variation that can be captured by lower dimensional factors. For high-dimensional data from multiple studies or environments, one goal is to understand which underlying factors are common to all studies, and which factors are study or environment-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are co-expressed; we may expect some clusters (or biological pathways) to be active for all diseases, while some clusters are only active for a specific disease. To learn these factors, we consider a nonlinear multi-study factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-study sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the latent factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data.
翻译:高维数据通常包含可由低维因子捕捉的变异。对于来自多个研究或环境的高维数据,一个核心目标是识别哪些潜在因子是所有研究共有的,哪些因子是特定于研究或环境的。以血小板基因表达数据为例,该数据来自不同疾病组的患者。在此类数据中,因子对应共表达基因簇;我们预期某些基因簇(或生物通路)在所有疾病中均活跃,而另一些簇仅对特定疾病活跃。为学习这些因子,我们提出一种非线性多研究因子模型,该模型同时容纳共享因子与特定因子。为拟合此模型,我们设计了一种多研究稀疏变分自编码器。该基础模型具有稀疏性:每个观测特征(即数据的每个维度)仅依赖于潜在因子的一个小子集。在基因组学示例中,这意味着每个基因仅参与少数生物过程。此外,该模型隐式地对潜在因子数量施加惩罚,有助于区分共享因子与组别特定因子。我们证明了潜在因子的可识别性,并通过血小板基因表达数据验证了本方法能恢复具有生物学意义的因子。