Modern datasets in biology and chemistry are often characterized by the presence of a large number of variables and outlying samples due to measurement errors or rare biological and chemical profiles. To handle the characteristics of such datasets we introduce a method to learn a robust ensemble comprised of a small number of sparse, diverse and robust models, the first of its kind in the literature. The degree to which the models are sparse, diverse and resistant to data contamination is driven directly by the data based on a cross-validation criterion. We establish the finite-sample breakdown of the ensembles and the models that comprise them, and we develop a tailored computing algorithm to learn the ensembles by leveraging recent developments in l0 optimization. Our extensive numerical experiments on synthetic and artificially contaminated real datasets from bioinformatics and cheminformatics demonstrate the competitive advantage of our method over state-of-the-art sparse and robust methods. We also demonstrate the applicability of our proposal on a cardiac allograft vasculopathy dataset.
翻译:现代生物学与化学数据集常因测量误差或稀有的生物化学特征而呈现大量变量与异常样本共存的特点。针对此类数据集特性,我们提出一种学习方法,可构建由少量稀疏、多样且鲁棒的模型组成的鲁棒集成——这在文献中尚属首次。模型稀疏性、多样性及抗数据污染程度由数据本身通过交叉验证准则直接驱动。我们建立了集成模型及其组成模型的有限样本崩溃点理论,并利用l0优化的最新进展开发了专用计算算法来学习此类集成。通过在生物信息学与化学信息学领域的合成数据集及人工污染真实数据集上开展的大量数值实验,我们的方法相较于现有最优的稀疏与鲁棒方法展现出显著竞争优势。此外,我们还在心脏移植物血管病变数据集上验证了该方法的适用性。