Identifying differentially methylated cytosine-guanine dinucleotide (CpG) sites between benign and tumour samples can assist in understanding disease. However, differential analysis of bounded DNA methylation data often requires data transformation, reducing biological interpretability. To address this, a family of beta mixture models (BMMs) is proposed that (i) objectively infers methylation state thresholds and (ii) identifies differentially methylated CpG sites (DMCs) given untransformed, beta-valued methylation data. The BMMs achieve this through model-based clustering of CpG sites and by employing parameter constraints, facilitating application to different study settings. Inference proceeds via an expectation-maximisation algorithm, with an approximate maximization step providing tractability and computational feasibility. Performance of the BMMs is assessed through thorough simulation studies, and the BMMs are used for differential analyses of DNA methylation data from a prostate cancer study. Intuitive and biologically interpretable methylation state thresholds are inferred and DMCs are identified, including those related to genes such as GSTP1, RASSF1 and RARB, known for their role in prostate cancer development. Gene ontology analysis of the DMCs revealed significant enrichment in cancer-related pathways, demonstrating the utility of BMMs to reveal biologically relevant insights. An R package betaclust facilitates widespread use of BMMs.
翻译:识别良性样本与肿瘤样本之间差异甲基化的胞嘧啶-鸟嘌呤二核苷酸(CpG)位点有助于理解疾病机制。然而,对有限区间DNA甲基化数据进行差异分析通常需要数据转换,这会降低生物学可解释性。为解决此问题,本文提出了一种β混合模型(BMM)家族,该模型能够(i)客观推断甲基化状态阈值,并(ii)基于未经转换的β值甲基化数据识别差异甲基化CpG位点(DMCs)。BMM通过基于模型的CpG位点聚类及参数约束实现这一目标,使其适用于不同的研究场景。模型推断采用期望最大化算法,其中近似最大化步骤保证了计算的可处理性与可行性。通过全面的模拟研究评估了BMM的性能,并将其应用于前列腺癌研究的DNA甲基化数据差异分析。研究推断出直观且具有生物学可解释性的甲基化状态阈值,并识别出包括GSTP1、RASSF1和RARB等已知在前列腺癌发展中起关键作用基因相关的DMCs。对DMCs的基因本体分析显示其在癌症相关通路中显著富集,证明了BMM在揭示生物学相关机制方面的实用性。R软件包betaclust的发布促进了BMM的广泛应用。