Metastatic prostate cancer is one of the most common cancers in men. In the advanced stages of prostate cancer, tumours can metastasise to other tissues in the body, which is fatal. In this thesis, we performed a genetic analysis of prostate cancer tumours at different metastatic sites using data science, machine learning and topological network analysis methods. We presented a general procedure for pre-processing gene expression datasets and pre-filtering significant genes by analytical methods. We then used machine learning models for further key gene filtering and secondary site tumour classification. Finally, we performed gene co-expression network analysis and community detection on samples from different prostate cancer secondary site types. In this work, 13 of the 14,379 genes were selected as the most metastatic prostate cancer related genes, achieving approximately 92% accuracy under cross-validation. In addition, we provide preliminary insights into the co-expression patterns of genes in gene co-expression networks. Project code is available at https://github.com/zcablii/Master_cancer_project.
翻译:转移性前列腺癌是男性最常见的癌症之一。在前列腺癌晚期阶段,肿瘤可能转移至身体其他组织,这通常是致命的。在本论文中,我们利用数据科学、机器学习和拓扑网络分析方法,对来自不同转移部位的前列腺癌肿瘤进行了遗传学分析。我们提出了一套通用的基因表达数据集预处理流程,并通过分析方法对显著基因进行预筛选。随后,我们采用机器学习模型进行进一步的關鍵基因过滤及次要部位肿瘤分类。最后,我们对不同前列腺癌次要部位类型的样本进行了基因共表达网络分析与社区发现。在本研究中,14379个基因中的13个被选为最具转移性的前列腺癌相关基因,在交叉验证下达到了约92%的准确率。此外,我们初步揭示了基因共表达网络中的基因共表达模式。项目代码见https://github.com/zcablii/Master_cancer_project。