Interpreting RNA-sequencing data requires identifying coordinated gene expression patterns that correspond to biological pathways. Standard factor models provide useful dimension reduction but typically ignore existing pathway knowledge or incorporate it through restrictive assumptions, limiting interpretability, and reproducibility. Here, we develop Bayesian Analysis with gene-Sets Informed Latent space (BASIL), a scalable framework for analyzing transcriptomic data that integrates annotated gene sets into latent variable inference. BASIL places structured priors on factor loadings, shrinking them toward combinations of annotated gene sets, enhancing biological interpretability and stability, while simultaneously learning new unstructured components. BASIL provides accurate covariance estimates and uncertainty quantification, without resorting to computationally expensive Markov chain Monte Carlo sampling, by exploiting a pre-training approach that pre-estimates the latent factors. An automatic empirical Bayes procedure eliminates the need for manual hyperparameter tuning, promoting reproducibility and usability in practice. Applying BASIL to the global fever transcriptomic cohort uncovers interpretable host-response modules, with phosphoinositide signaling and interferon-driven inflammation emerging as key drivers of gene-expression variability.
翻译:解读RNA测序数据需要识别与生物通路相对应的协调基因表达模式。标准因子模型虽然提供了有效的降维方法,但通常忽略了现有的通路知识或通过限制性假设将其纳入,从而限制了可解释性和可重复性。本文开发了基因集信息潜在空间贝叶斯分析(BASIL)框架,这是一种可扩展的分析转录组数据的方法,能将注释基因集整合到潜在变量推断中。BASIL对因子载荷施加结构化先验,将其向注释基因集的组合收缩,增强了生物学可解释性和稳定性,同时学习新的非结构化成分。通过利用预估计潜在因子的预训练方法,BASIL无需借助计算昂贵的马尔可夫链蒙特卡洛采样即可提供准确的协方差估计和不确定性量化。自动经验贝叶斯过程消除了手动超参数调优的需要,促进了实践中的可重复性和易用性。将BASIL应用于全球发热转录组队列,揭示了可解释的宿主反应模块,其中磷酸肌醇信号传导和干扰素驱动的炎症是基因表达变异的关键驱动因素。