Penalized logistic regression is extremely useful for binary classification with large number of covariates (higher than the sample size), having several real life applications, including genomic disease classification. However, the existing methods based on the likelihood loss function are sensitive to data contamination and other noise and, hence, robust methods are needed for stable and more accurate inference. In this paper, we propose a family of robust estimators for sparse logistic models utilizing the popular density power divergence based loss function and the general adaptively weighted LASSO penalties. We study the local robustness of the proposed estimators through its influence function and also derive its oracle properties and asymptotic distribution. With extensive empirical illustrations, we demonstrate the significantly improved performance of our proposed estimators over the existing ones with particular gain in robustness. Our proposal is finally applied to analyse four different real datasets for cancer classification, obtaining robust and accurate models, that simultaneously performs gene selection and patient classification.
翻译:惩罚逻辑回归在协变量数量庞大(超过样本量)的二分类问题中极具实用性,可应用于基因组疾病分类等多项现实场景。然而,基于似然损失函数的现有方法对数据污染及其他噪声敏感,因此需要稳健方法以实现稳定且更精确的推断。本文利用基于密度幂散度的损失函数与通用的自适应加权LASSO惩罚,提出了一族稀疏逻辑模型的稳健估计量。我们通过影响函数研究了所提估计量的局部稳健性,并推导了其Oracle性质与渐近分布。通过大量实证分析,我们证明了所提估计量相较于现有方法在性能上的显著提升,尤其在稳健性方面具有特殊优势。最终,我们将所提方法应用于四种不同真实癌症分类数据集的分析,获得了兼具基因筛选与患者分类功能的稳健精确模型。