Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Effective statistical methods should excel in both prediction accuracy and biomarker identification. We introduce a novel approach to high-dimensional binary classification that integrates regularization with ensembling techniques. The method constructs compact ensembles of interpretable models derived by optimizing a global objective function. In medical genomics applications, the proposed approach identifies critical biomarkers overlooked by competing methods. We develop a variable importance ranking system to help researchers prioritize promising genes. The method's asymptotic properties are established, and an efficient computational algorithm is provided. Through extensive simulations across complex scenarios and analysis of cancer genomics datasets, we demonstrate strong predictive performance. Based on the numerical experiments, we offer practical guidelines for determining optimal ensemble size.
翻译:基因组学数据采集技术的进步显著提升了对研究多种疾病遗传基础工具的需求。有效的统计方法应在预测准确性和生物标志物识别两方面均表现优异。我们提出一种将正则化与集成技术相结合的高维二分类新方法。该方法通过优化全局目标函数构建由可解释模型组成的紧凑型集成模型。在医学基因组学应用中,所提方法能识别被竞争方法忽略的关键生物标志物。我们开发了变量重要性排序系统以帮助研究人员优先定位有潜力的基因。该方法建立了渐近性质,并提供了高效计算算法。通过复杂场景下的广泛模拟实验及癌症基因组数据集分析,我们展示了其强大的预测性能。基于数值实验结果,我们为确定最优集成规模提供了实用指南。