Datasets with sheer volume have been generated from fields including computer vision, medical imageology, and astronomy whose large-scale and high-dimensional properties hamper the implementation of classical statistical models. To tackle the computational challenges, one of the efficient approaches is subsampling which draws subsamples from the original large datasets according to a carefully-design task-specific probability distribution to form an informative sketch. The computation cost is reduced by applying the original algorithm to the substantially smaller sketch. Previous studies associated with subsampling focused on non-regularized regression from the computational efficiency and theoretical guarantee perspectives, such as ordinary least square regression and logistic regression. In this article, we introduce a randomized algorithm under the subsampling scheme for the Elastic-net regression which gives novel insights into L1-norm regularized regression problem. To effectively conduct consistency analysis, a smooth approximation technique based on alpha absolute function is firstly employed and theoretically verified. The concentration bounds and asymptotic normality for the proposed randomized algorithm are then established under mild conditions. Moreover, an optimal subsampling probability is constructed according to A-optimality. The effectiveness of the proposed algorithm is demonstrated upon synthetic and real data datasets.
翻译:来自计算机视觉、医学影像学和天文学等领域生成的数据集具有海量规模,其大规模和高维特性阻碍了经典统计模型的实施。为应对计算挑战,一种有效方法是子抽样——根据精心设计的任务特定概率分布从原始大规模数据集中抽取子样本,构建信息丰富的压缩数据集。通过将原始算法应用于大幅缩小的压缩数据,可降低计算成本。此前与子抽样相关的研究从计算效率和理论保证角度聚焦于非正则化回归(如普通最小二乘回归和逻辑回归)。本文在子抽样框架下引入一种随机化算法用于Elastic-net回归,为L1范数正则化回归问题提供了新见解。为有效进行一致性分析,首次采用基于α绝对值函数的平滑近似技术并完成理论验证。在温和条件下建立了所提随机化算法的集中界和渐近正态性。此外,根据A最优性准则构建了最优子抽样概率。通过合成数据集和真实数据集验证了该算法的有效性。