Inferring cellular heterogeneity with mixture models for DNA methylation rates

Cellular heterogeneity is a hallmark of biological tissues and plays a central role in disease progression, diagnosis, and prognosis. Yet, accurately characterizing this heterogeneity from bulk molecular profiles remains challenging because observed signals arise from mixtures of multiple cell populations. Cell deconvolution aim to recover the relative abundance of constituent cell types from such heterogeneous measurements, but most existing approaches implicitly rely on restrictive assumptions on residual errors, including independence, homoscedasticity, and normality. These assumptions are rarely satisfied in omics data, which are inherently bounded and overdispersed. In this work, we show that whole-genome cell-type specific DNA methylation profiles exhibit latent group structures that can substantially impair deconvolution accuracy when ignored. We therefore propose a mixture of non-negative Beta regression models estimated through an Expectation-Maximization algorithm for DNA methylation rates. Our framework naturally incorporates a feature selection mechanism through mixture component identification, making component selection a critical step of the inference procedure. We further propose a dedicated criterion for component selection and assess the performance of the approach through an extensive comparative study across several in vitro benchmark datasets. Our results demonstrate that deconvolution accuracy is highly sensitive to latent component structure and show that explicitly modeling this heterogeneity yields substantial improvements over standard whole-genome deconvolution strategies. Altogether, this work establishes mixture modeling of DNA methylation data as a powerful new direction for robust and accurate cell deconvolution.

翻译：细胞异质性是生物组织的标志，在疾病进展、诊断和预后中起着核心作用。然而，由于观测信号来源于多种细胞群体的混合，从整体分子谱中准确表征这种异质性仍具挑战。细胞反卷积旨在从这类异质性测量中恢复组成细胞类型的相对丰度，但现有大多数方法隐式依赖于残差误差的严格假设，包括独立性、同方差性和正态性。这些假设在组学数据中很少成立，因为组学数据本质上有界且过度离散。在本研究中，我们揭示全基因组细胞类型特异性DNA甲基化谱存在潜在群体结构，若忽略该结构将显著损害反卷积精度。因此，针对DNA甲基化率，我们提出了一种基于非负Beta回归模型的混合模型，并通过期望最大化算法进行参数估计。我们的框架通过混合成分识别自然嵌入特征选择机制，使成分选择成为推断流程的关键步骤。我们进一步提出专用准则进行成分选择，并通过多个体外基准数据集的广泛比较研究评估该方法的性能。结果表明，反卷积精度对潜在成分结构高度敏感，且显式建模这种异质性相较于标准全基因组反卷积策略能带来显著改进。总体而言，本研究将DNA甲基化数据的混合建模确立为稳健精准细胞反卷积的强大新方向。