Generative modeling of single-cell RNA-seq data is crucial for tasks like trajectory inference, batch effect removal, and simulation of realistic cellular data. However, recent deep generative models simulating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, overlooking the discrete nature of single-cell data, which limits their effectiveness and hinders the incorporation of robust noise models. Additionally, aspects like controllable multi-modal and multi-label generation of cellular data remain underexplored. This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data. CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics while tackling relevant generative tasks such as rare cell type augmentation and batch correction. We also introduce a novel framework for compositional data generation using Flow Matching. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models.
翻译:单细胞RNA-seq数据的生成建模对于轨迹推断、批次效应消除以及真实细胞数据模拟等任务至关重要。然而,当前基于噪声生成合成单细胞的深度生成模型通常操作于预处理后的连续基因表达近似值,忽略了单细胞数据固有的离散特性,这限制了模型效能并阻碍了鲁棒噪声模型的整合。此外,诸如可控的多模态与多标签细胞数据生成等方面仍待深入探索。本研究提出了用于生成的细胞流模型(CFGen),这是一种基于流的条件生成模型,能够保持单细胞数据的内在离散性。CFGen可可靠地生成全基因组多模态单细胞数据,在提升关键生物学数据特征复原能力的同时,解决了稀有细胞类型扩增与批次校正等相关生成任务。我们还提出了一种基于流匹配的组合数据生成新框架。通过在多样化的生物数据集与场景中展示CFGen,我们为其在计算生物学与深度生成模型领域的价值提供了实证依据。