Genome-Wide Association Studies (GWAS) help identify genetic variations in people with diseases such as Parkinson's disease (PD), which are less common in those without the disease. Thus, GWAS data can be used to identify genetic variations associated with the disease. Feature selection and machine learning approaches can be used to analyze GWAS data and identify potential disease biomarkers. However, GWAS studies have technical variations that affect the reproducibility of identified biomarkers, such as differences in genotyping platforms and selection criteria for individuals to be genotyped. To address this issue, we collected five GWAS datasets from the database of Genotypes and Phenotypes (dbGaP) and explored several data integration strategies. We evaluated the agreement among different strategies in terms of the Single Nucleotide Polymorphisms (SNPs) that were identified as potential PD biomarkers. Our results showed a low concordance of biomarkers discovered using different datasets or integration strategies. However, we identified fifty SNPs that were identified at least twice, which could potentially serve as novel PD biomarkers. These SNPs are indirectly linked to PD in the literature but have not been directly associated with PD before. These findings open up new potential avenues of investigation.
翻译:全基因组关联研究(GWAS)有助于识别帕金森病(PD)患者中存在的、而健康人群中较少见的遗传变异。因此,GWAS数据可用于识别与疾病相关的遗传变异。特征选择和机器学习方法可用于分析GWAS数据并识别潜在的疾病生物标志物。然而,GWAS研究存在技术差异,例如基因分型平台的不同以及待分型个体的选择标准差异,这些都会影响所识别生物标志物的可重复性。为解决这一问题,我们从基因型和表型数据库(dbGaP)收集了五个GWAS数据集,并探索了多种数据整合策略。我们评估了不同策略在识别为潜在PD生物标志物的单核苷酸多态性(SNP)方面的一致性。结果显示,使用不同数据集或整合策略发现的生物标志物之间的一致性较低。然而,我们识别出五十个至少被识别两次的SNP,这些SNP可能作为新的PD生物标志物。这些SNP在文献中间接与PD相关,但此前未被直接关联到PD。这些发现开辟了新的潜在研究方向。