Generative modeling of single-cell RNA-seq data has shown invaluable potential in community-driven tasks such as trajectory inference, batch effect removal and gene expression generation. However, most recent deep models generating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, ignoring the inherently discrete and over-dispersed nature of single-cell data, which limits downstream applications and hinders the incorporation of robust noise models. Moreover, crucial aspects of deep-learning-based synthetic single-cell generation remain underexplored, such as controllable multi-modal and multi-label generation and its role in the performance enhancement of downstream tasks. This work presents Cell Flow for Generation (CFGen), a flow-based conditional generative model for multi-modal single-cell counts, which explicitly accounts for the discrete nature of the data. Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks such as conditioning on multiple attributes and boosting rare cell type classification via data augmentation. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models.
翻译:单细胞RNA-seq数据的生成建模在轨迹推断、批次效应消除和基因表达生成等社区驱动任务中展现出不可估量的潜力。然而,近期大多数从噪声生成合成单细胞的深度模型均基于预处理后的连续基因表达近似值进行操作,忽略了单细胞数据固有的离散性与过度分散特性,这限制了下游应用并阻碍了稳健噪声模型的整合。此外,基于深度学习的合成单细胞生成的关键方面仍未得到充分探索,例如可控的多模态多标签生成及其对下游任务性能提升的作用。本研究提出用于生成的细胞流模型(CFGen),这是一种基于流的多模态单细胞计数条件生成模型,其显式地考虑了数据的离散特性。我们的研究结果表明,该方法在考虑多属性条件调控和通过数据增强提升稀有细胞类型分类等新型生成任务的同时,能够更好地恢复关键生物学数据特征。通过在多种生物数据集和场景中展示CFGen,我们为其在计算生物学和深度生成模型领域的价值提供了实证依据。