In the field of big data analytics, the search for efficient subdata selection methods that enable robust statistical inferences with minimal computational resources is of high importance. A procedure prior to subdata selection could perform variable selection, as only a subset of a large number of variables is active. We propose an approach when both the size of the full dataset and the number of variables are large. This approach firstly identifies the active variables by applying a procedure inspired by random LASSO (Least Absolute Shrinkage and Selection Operator) and then selects subdata based on leverage scores to build a predictive model. Our proposed approach outperforms approaches that already exists in the current literature, including the usage of the full dataset, in both variable selection and prediction, while also exhibiting significant improvements in computing time. Simulation experiments as well as a real data application are provided.
翻译:在大数据分析领域,寻找能够以最小计算资源实现稳健统计推断的高效子数据选择方法至关重要。在进行子数据选择前,可先执行变量选择步骤,因为大量变量中通常仅有一部分是活跃的。我们针对全数据集规模与变量数量均较大的情形提出一种方法。该方法首先通过受随机LASSO(最小绝对收缩与选择算子)启发的流程识别活跃变量,随后基于杠杆得分选择子数据以构建预测模型。我们提出的方法在变量选择与预测性能上均优于现有文献中的方法(包括使用全数据集的方法),同时在计算时间上展现出显著优势。文中提供了模拟实验与真实数据应用的结果。