In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.
翻译:在统计学与机器学习中,逻辑回归作为一种广泛应用的监督学习技术,主要解决二分类任务。当观测样本数量远大于预测变量个数时,我们提出一种基于随机采样的简单算法,能同时保证对估计概率与模型整体偏差的高质量近似。我们的分析基于两个简洁的结构性条件,这些条件可归结为随机矩阵乘法——随机数值线性代数中一种基础且充分理解的运算法则。我们分析了使用杠杆分数采样观测值时逻辑回归估计概率的性质,并证明通过大小远小于总观测数的样本即可实现精确近似。为进一步验证理论发现,我们开展了全面的实证评估。总体而言,本研究揭示了利用随机采样方法高效近似逻辑回归估计概率的潜力,为大规模数据集提供了一种实用且计算高效的解决方案。