In the relentless efforts in enhancing medical diagnostics, the integration of state-of-the-art machine learning methodologies has emerged as a promising research area. In molecular biology, there has been an explosion of data generated from multi-omics sequencing. The advent sequencing equipment can provide large number of complicated measurements per one experiment. Therefore, traditional statistical methods face challenging tasks when dealing with such high dimensional data. However, most of the information contained in these datasets is redundant or unrelated and can be effectively reduced to significantly fewer variables without losing much information. Dimensionality reduction techniques are mathematical procedures that allow for this reduction; they have largely been developed through statistics and machine learning disciplines. The other challenge in medical datasets is having an imbalanced number of samples in the classes, which leads to biased results in machine learning models. This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features, and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent space is the reduced dimensional space that captures the meaningful features of the original data. Our model starts with feature selection to select the discriminative features before feeding them to the neural network. Then, the model predicts the outcome of cancer for different datasets. The proposed model outperformed other existing models by scoring accuracy of 95.09% for bladder cancer dataset and 88.82% for the breast cancer dataset.
翻译:在持续提升医学诊断能力的努力中,融合前沿机器学习方法已成为一个充满前景的研究领域。分子生物学领域,多组学测序技术催生了数据量的爆炸式增长。先进的测序设备每次实验能提供大量复杂测量数据,因此传统统计方法在处理此类高维数据时面临严峻挑战。然而,这些数据集中包含的信息大多冗余或不相关,可通过有效降维大幅减少变量数量而几乎不损失信息。降维技术作为实现这一目标的数学方法,主要源于统计学和机器学习学科的发展。医学数据集的另一挑战是不同类别样本数量的不平衡,这会导致机器学习模型产生有偏结果。本研究聚焦于通过神经网络解决这些挑战——该网络融合自编码器提取特征潜在空间,并利用生成对抗网络生成合成样本。潜在空间是捕捉原始数据显著特征的低维空间。我们的模型首先进行特征选择,筛选出判别性特征后再输入神经网络,随后预测不同数据集的癌症结果。与现有模型相比,所提模型在膀胱癌数据集上取得了95.09%的准确率,在乳腺癌数据集上取得了88.82%的准确率,表现更优。