Combining multiple predictors obtained from distributed data sources to an accurate meta-learner is promising to achieve enhanced performance in lots of prediction problems. As the accuracy of each predictor is usually unknown, integrating the predictors to achieve better performance is challenging. Conventional ensemble learning methods assess the accuracy of predictors based on extensive labeled data. In practical applications, however, the acquisition of such labeled data can prove to be an arduous task. Furthermore, the predictors under consideration may exhibit high degrees of correlation, particularly when similar data sources or machine learning algorithms were employed during their model training. In response to these challenges, this paper introduces a novel structured unsupervised ensemble learning model (SUEL) to exploit the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights. Two novel correlation-based decomposition algorithms are further proposed to estimate the SUEL model, constrained quadratic optimization (SUEL.CQO) and matrix-factorization-based (SUEL.MF) approaches. The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery. The results compellingly demonstrate that the proposed methods can efficiently integrate the dependent predictors to an ensemble model without the need of ground truth data.
翻译:将分布式数据源获得的多个预测器组合成精确的元学习器,有望在众多预测问题中实现性能提升。由于各预测器的准确性通常未知,如何整合这些预测器以获得更优性能颇具挑战。传统的集成学习方法依赖大量标注数据来评估预测器的准确性。然而在实际应用中,获取此类标注数据往往十分困难。此外,当预测器在模型训练阶段使用了相似数据源或机器学习算法时,它们之间可能存在高度相关性。针对这些挑战,本文提出了一种新颖的结构化无监督集成学习模型(SUEL),该模型能够利用具有连续预测分数的预测器集合之间的依赖关系,在无需标注数据的情况下对预测器进行排序,并通过加权方式将其组合为集成分数。本文进一步提出了两种基于相关性的分解算法来估计SUEL模型:约束二次优化(SUEL.CQO)方法和基于矩阵分解(SUEL.MF)的方法。通过模拟研究和风险基因发现的实际应用,对所提方法的有效性进行了严格评估。结果充分证明,所提方法能够在无需真实标签数据的情况下,高效地将相关预测器整合为集成模型。