Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection

from arxiv, Added a picture with the algorithm steps and a supplementary section with disambiguation of the technical terms. Moved sections in the supplementary to shorten the main text. Fixed typos

The challenge in biomarker discovery using machine learning from omics data lies in the abundance of molecular features but scarcity of samples. Most feature selection methods in machine learning require evaluating various sets of features (models) to determine the most effective combination. This process, typically conducted using a validation dataset, involves testing different feature sets to optimize the model's performance. Evaluations have performance estimation error and when the selection involves many models the best ones are almost certainly overestimated. Biomarker identification with feature selection methods can be addressed as a multi-objective problem with trade-offs between predictive ability and parsimony in the number of features. Genetic algorithms are a popular tool for multi-objective optimization but they evolve numerous solutions thus are prone to overestimation. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose DOSA-MO, a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.

翻译：生物标志物发现中利用组学数据进行机器学习的挑战在于分子特征丰富但样本稀缺。大多数机器学习中的特征选择方法需要评估不同的特征集（模型）以确定最有效的组合。这一过程通常通过验证数据集进行，涉及测试不同特征集以优化模型性能。评估存在性能估计误差，当选择涉及大量模型时，最优模型几乎必然被高估。采用特征选择方法的生物标志物识别可视为一个多目标优化问题，需要在预测能力与特征数量简洁性之间权衡。遗传算法是多目标优化的常用工具，但其进化出大量解，因此容易产生高估。已有方法可在单目标问题中在选定模型后减少高估，但尚无算法能够在优化过程中降低高估、改进模型选择、或适用于更通用的多目标领域。我们提出DOSA-MO，一种新型多目标优化封装算法，它学习解的原始估计值、方差及特征集大小如何预测高估程度。DOSA-MO在优化过程中调整性能期望值，从而改善解集的组成。我们通过三个肾癌和乳腺癌转录组数据集验证，DOSA-MO在预测癌症亚型和/或患者总生存率时，能够提升现有最优遗传算法在留出样本或外部样本集上的性能。