Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

The analysis of high-dimensional data, ubiquitous in fields such as genomics, is frequently complicated by the presence of cellwise contamination, where individual cells rather than entire rows are corrupted. This contamination poses a significant challenge to standard variable selection techniques. While recent ensemble methods have introduced deterministic frameworks that partition the predictor space to manage high collinearity, these modern architectures were not designed to handle cellwise contamination, leaving a critical methodological gap. To bridge this gap, we propose the Fast and Scalable Cellwise-Robust Ensemble (FSCRE) algorithm, a novel, multi-stage framework integrating three key statistical stages. First, the algorithm establishes a robust foundation by deriving a cleaned data matrix and a reliable, cellwise-robust covariance structure. Variable selection then proceeds via a competitive ensemble: a robust, correlation-based formulation of the Least-Angle Regression (LARS) algorithm proposes candidates for multiple sub-models, and a cross-validation criterion arbitrates their final assignment. Despite its architectural complexity, the proposed method possesses fundamental theoretical properties, including invariance to data scaling and equivariance to predictor permutation, which establish its objectivity. Through extensive simulations and a bioinformatics application, we demonstrate FSCRE's superior performance in variable selection precision, recall, and predictive accuracy across various contamination scenarios. This work provides a unified framework connecting cellwise-robust estimation with high-performance ensemble learning, with an implementation available on CRAN.

翻译：高维数据分析普遍存在于基因组学等领域，而细胞层面污染（即单个数据单元而非整行数据受损）常使此类分析复杂化。这种污染对标准变量选择技术构成重大挑战。尽管近年来的集成方法引入了确定性框架以划分预测变量空间来应对高共线性问题，但这些现代架构并未设计用于处理细胞层面污染，形成了关键的方法论缺口。为填补这一空白，我们提出快速可扩展细胞稳健集成（FSCRE）算法——一种整合三个关键统计阶段的新型多阶段框架。首先，该算法通过推导清洁数据矩阵和可靠的细胞稳健协方差结构建立稳健基础。随后通过竞争性集成进行变量选择：基于稳健相关性的最小角回归（LARS）算法为多个子模型提出候选变量，并由交叉验证准则裁定最终分配。尽管架构复杂，该方法具备基础理论性质，包括数据缩放不变性与预测变量排列等变性，保障了其客观性。通过大量模拟实验和生物信息学应用，我们证明FSCRE在不同污染场景下在变量选择精度、召回率和预测准确性方面均表现优越。本研究构建了连接细胞稳健估计与高性能集成学习的统一框架，相关实现已发布在CRAN上。