As hypermethylation of promoter cytosine-guanine dinucleotide (CpG) islands has been shown to silence tumour suppressor genes, identifying differentially methylated CpG sites between different samples can assist in understanding disease. Differentially methylated CpG sites (DMCs) can be identified using moderated t-tests or nonparametric tests, but this requires the use of data transformations due to the lack of appropriate statistical methods able to adequately account for the bounded nature of DNA methylation data. We propose a family of beta mixture models (BMMs) which use a model-based approach to cluster CpG sites given their original beta-valued methylation data, with no need for transformations. The BMMs allow (i) objective inference of methylation state thresholds and (ii) identification of DMCs between different sample types. The BMMs employ different parameter constraints facilitating application to different study settings. Parameter estimation proceeds via an expectation-maximisation algorithm, with a novel approximation in the maximization step providing tractability and computational feasibility. Performance of BMMs is assessed through thorough simulation studies, and the BMMs are used to analyse a prostate cancer (PCa) dataset. The BMMs objectively infer intuitive and biologically interpretable methylation state thresholds, and identify DMCs that are related to genes implicated in carcinogenesis and involved in cancer related pathways. An R package betaclust facilitates widespread use of BMMs.
翻译:鉴于启动子CpG岛的高甲基化已被证实可沉默肿瘤抑制基因,识别不同样本间的差异甲基化CpG位点有助于理解疾病机制。目前可通过调节t检验或非参数检验识别差异甲基化CpG位点,但由于缺乏能充分解释DNA甲基化数据边界性质的适当统计方法,需使用数据转换技术。本文提出β混合模型族,该族模型采用基于模型的方法,直接基于原始β值甲基化数据对CpG位点进行聚类,无需数据转换。该模型族能够:(i)客观推断甲基化状态阈值,(ii)识别不同样本类型间的差异甲基化CpG位点。模型通过施加不同参数约束以适应多种研究场景,参数估计采用期望最大化算法,其中最大化步骤采用新型近似方法确保可操作性与计算可行性。通过全面仿真研究评估模型性能,并应用于前列腺癌数据集分析。该模型族可客观推断出直观且具有生物学可解释性的甲基化状态阈值,并识别与癌变相关基因及癌症相关通路相关的差异甲基化CpG位点。R语言包betaclust将促进该模型族的广泛应用。