Regression can be really difficult in case of big datasets, since we have to dealt with huge volumes of data. The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size, and so a standard approach is subsampling that aims at obtaining the most informative portion of the big data. In the current paper we consider an approach based on leverages scores, already existing in the current literature. The aforementioned approach proposed in order to select subdata for linear model discrimination. However, we highlight its importance on the selection of data points that are the most informative for estimating unknown parameters. We conclude that the approach based on leverage scores improves existing approaches, providing simulation experiments as well as a real data application.
翻译:在大数据集的情况下,回归分析可能变得相当困难,因为我们必须处理海量数据。随着数据集规模的扩大,建模过程对计算资源的需求也随之增加,这是因为传统的回归方法涉及对大型数据矩阵进行求逆。核心问题在于数据规模庞大,因此一种标准方法是采用子采样技术,旨在获取大数据中最具信息量的部分。在本文中,我们考虑一种基于杠杆值的已有方法。该方法最初是为了线性模型判别中的子数据选择而提出的。然而,我们强调其在选择对估计未知参数最具信息量的数据点方面的重要性。通过模拟实验和实际数据应用,我们得出结论:基于杠杆值的方法改进了现有方法。