The challenge in biomarker discovery using machine learning from omics data lies in the abundance of molecular features but scarcity of samples. Most feature selection methods in machine learning require evaluating various sets of features (models) to determine the most effective combination. This process, typically conducted using a validation dataset, involves testing different feature sets to optimize the model's performance. Evaluations have performance estimation error and when the selection involves many models the best ones are almost certainly overestimated. Biomarker identification with feature selection methods can be addressed as a multi-objective problem with trade-offs between predictive ability and parsimony in the number of features. Genetic algorithms are a popular tool for multi-objective optimization but they evolve numerous solutions thus are prone to overestimation. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose DOSA-MO, a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.
翻译:基于组学数据的机器学习在生物标志物发现中的挑战在于分子特征丰富但样本稀缺。大多数机器学习中的特征选择方法需要评估各种特征集(模型)以确定最有效的组合。这一过程通常使用验证数据集进行,通过测试不同特征集来优化模型性能。评估存在性能估计误差,当选择涉及大量模型时,最优模型几乎必然被过估计。采用特征选择方法的生物标志物识别可视为一个多目标问题,需要在预测能力与特征数量简约性之间进行权衡。遗传算法是多目标优化的常用工具,但其演化出大量解,因此容易产生过估计。已有方法可在单目标问题中减少选定模型后的过估计,但尚未有算法能够在优化过程中降低过估计、改善模型选择或应用于更通用的多目标领域。我们提出DOSA-MO,一种新颖的多目标优化包装算法,能够学习原始估计值、估计方差及解的特征集规模如何预测过估计程度。DOSA-MO在优化过程中调整性能期望,从而改进解集的组成。通过使用三个肾脏癌和乳腺癌的转录组数据集,我们验证了DOSA-MO在预测癌症亚型和/或患者总生存期时,能够提升先进遗传算法在留出数据集或外部样本集上的性能。