Single-cell RNA-seq profiles are high-dimensional, sparse, and unordered, causing autoregressive generation to impose an artificial ordering bias and suffer from error accumulation. To address this, we propose scDiVa, a masked discrete diffusion foundation model that aligns generation with the dropout-like corruption process by defining a continuous-time forward masking mechanism in token space. ScDiVa features a bidirectional denoiser that jointly models discrete gene identities and continuous values, utilizing entropy-normalized serialization and a latent anchor token to maximize information efficiency and preserve global cell identity. The model is trained via depth-invariant time sampling and a dual denoising objective to simulate varying sparsity levels while ensuring precise recovery of both identity and magnitude. Pre-trained on 59 million cells, scDiVa achieves strong transfer performance across major benchmarks, including batch integration, cell type annotation, and perturbation response prediction. These results suggest that masked discrete diffusion serves as a biologically coherent and effective alternative to autoregression.
翻译:单细胞RNA测序数据具有高维、稀疏且无序的特性,导致自回归生成方法会引入人为的排序偏差并受到误差累积的影响。为解决这一问题,我们提出了scDiVa——一种掩码离散扩散基础模型,该模型通过在标记空间定义连续时间前向掩码机制,使生成过程与类似丢失的破坏过程保持一致。ScDiVa采用双向去噪器联合建模离散基因身份与连续表达值,利用熵归一化序列化方法和潜在锚定标记来最大化信息效率并保持细胞全局身份。模型通过深度不变时间采样和双重去噪目标进行训练,以模拟不同的稀疏度水平,同时确保身份与表达强度的精确恢复。基于5900万个细胞预训练的scDiVa在批次整合、细胞类型注释和扰动响应预测等主要基准任务中均表现出强大的迁移性能。这些结果表明,掩码离散扩散可作为自回归方法的生物学一致且有效的替代方案。