In large-scale supervised learning, penalized logistic regression (PLR) effectively mitigates overfitting through regularization, yet its performance critically depends on robust variable selection. This paper demonstrates that label noise introduced during manual annotation, often dismissed as a mere artifact, can serve as a valuable source of information to enhance variable selection in PLR. We theoretically show that such noise, intrinsically linked to classification difficulty, helps refine the estimation of non-zero coefficients compared to using only ground truth labels, effectively turning a common imperfection into a useful information resource. To efficiently leverage this form of information fusion in large-scale settings where data cannot be stored on a single machine, we propose a novel partition insensitive parallel algorithm based on the alternating direction method of multipliers (ADMM). Our method ensures that the solution remains invariant to how data is distributed across workers, a key property for reproducible and stable distributed learning, while guaranteeing global convergence at a sublinear rate. Extensive experiments on multiple large-scale datasets show that the proposed approach consistently outperforms conventional variable selection techniques in both estimation accuracy and classification performance, affirming the value of intentionally fusing noisy manual labels into the learning process.
翻译:在大规模监督学习中,惩罚逻辑回归(PLR)通过正则化有效缓解过拟合,但其性能关键依赖于稳健的变量选择。本文证明,在人工标注过程中引入的标签噪声——常被视为单纯的人工产物——可作为增强PLR变量选择的有价值信息源。我们从理论上表明,与仅使用真实标签相比,这种与分类难度内在关联的噪声有助于细化非零系数的估计,从而将常见的缺陷转化为有效的信息资源。为在大规模数据无法存储于单机的场景中高效利用此类信息融合,我们提出一种基于交替方向乘子法(ADMM)的新型分区不敏感并行算法。该方法确保解对数据在计算节点间的分布方式保持不变,这是实现可复现且稳定分布式学习的关键特性,同时保证以次线性速率全局收敛。在多个大规模数据集上的大量实验表明,所提方法在估计精度和分类性能上均持续优于传统变量选择技术,证实了将噪声人工标签有意融合至学习过程的价值。