The IBOSS approach proposed by Wang et al. (2019) selects the most informative subset of n points. It assumes that the ordinary least squares method is used and requires that the number of variables, p, is not large. However, in many practical problems, p is very large and penalty-based model fitting methods such as LASSO is used. We study the big data problems, in which both n and p are large. In the first part, we focus on reduction in data points. We develop theoretical results showing that the IBOSS type of approach can be applicable to penalty-based regressions such as LASSO. In the second part, we consider the situations where p is extremely large. We propose a two-step approach that involves first reducing the number of variables and then reducing the number of data points. Two separate algorithms are developed, whose performances are studied through extensive simulation studies. Compared to existing methods including well-known split-and-conquer approach, the proposed methods enjoy advantages in terms of estimation accuracy, prediction accuracy, and computation time.
翻译:Wang等人(2019)提出的IBOSS方法筛选出最具信息量的n个子集数据点。该方法假设采用普通最小二乘法,并要求变量数p不宜过大。然而,在许多实际问题中p值极大,且常采用基于惩罚项的模型拟合方法(如LASSO)。本文研究n和p均很大的大数据问题。在第一部分中,我们聚焦于数据点的缩减,通过理论推导证明IBOSS类方法可适用于基于惩罚项的回归模型(如LASSO)。第二部分中,我们考虑p值极大的情形,提出两步法:先缩减变量数量,再缩减数据点数量。我们分别开发了两种算法,并通过大量模拟实验评估其性能。与现有方法(包括广为人知的“分而治之”方法)相比,本文提出的方法在参数估计精度、预测精度及计算耗时方面均具有优势。