The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size, and so a standard approach is subsampling that aims at obtaining the most informative portion of the big data. In the current paper, we explore an existing approach based on leverage scores, proposed for subdata selection in linear model discrimination. Our objective is to propose the aforementioned approach for selecting the most informative data points to estimate unknown parameters in both the first-order linear model and a model with interactions. We conclude that the approach based on leverage scores improves existing approaches, providing simulation experiments as well as a real data application.
翻译:随着数据集规模的增加,建模过程对计算资源的需求也随之增长,因为传统的回归方法涉及对大型数据矩阵求逆。主要问题在于数据规模庞大,因此一种标准方法是子抽样,旨在获取大数据中最具信息量的部分。在本文中,我们探讨了一种基于杠杆分数的现有方法,该方法最初用于线性模型判别中的子数据选择。我们的目标是提出上述方法,用于选择最具信息量的数据点,以估计一阶线性模型以及包含交互作用的模型中的未知参数。我们得出结论,基于杠杆分数的方法改进了现有方法,并通过仿真实验以及实际数据应用进行了验证。