In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.
翻译:在统计学与机器学习中,逻辑回归是一种广泛应用于二分类任务的监督学习方法。当观测数量远大于预测变量数量时,我们提出了一种基于随机采样的简单算法来解决逻辑回归问题,该算法能确保对估计概率及模型整体偏差均提供高质量近似。我们的分析基于两个简单的结构性条件,这些条件可归结为随机矩阵乘法——这一随机数值线性代数中的基础且充分理解的原始操作。我们分析了在利用杠杆分数对观测进行采样时逻辑回归估计概率的性质,并证明当样本规模远小于总观测数时仍能实现精确近似。为验证理论结果,我们开展了全面的实证评估。总体而言,本研究揭示了利用随机采样方法高效近似逻辑回归估计概率的潜力,为大规模数据集提供了一种实用且计算高效的解决方案。