In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.
翻译:在统计学与机器学习中,逻辑回归是一种广泛使用的监督学习方法,主要应用于二分类任务。当观测样本数量远超预测变量数量时,我们提出一种基于随机采样的简单算法来求解逻辑回归问题,该算法能够保证对估计概率及模型整体差异的高质量近似。我们的分析基于两个简单的结构性条件,这些条件归结为随机矩阵乘法——这一随机数值线性代数中基础且理解透彻的基本操作。我们分析了使用杠杆分数对观测样本进行采样时逻辑回归估计概率的性质,并证明在样本规模远小于总观测数的情况下,仍能实现高精度的近似。为进一步验证理论结果,我们开展了全面的实证评估。总体而言,我们的工作揭示了利用随机采样方法高效近似逻辑回归中估计概率的潜力,为大规模数据集提供了一种实用且计算高效的解决方案。