Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that differs from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample.
翻译:数据约简是现代技术面临的基本挑战,由于计算限制,经典统计方法在此背景下难以适用。本研究针对观测数量极其庞大但协变量较少的线性回归问题展开分析。通过子抽样方法,从原始数据中选取特定比例的数据点。在协变量满足分布假设的条件下,我们推导出D-最优子抽样设计并研究其理论性质。研究运用了最优设计理论的基本概念以及约束凸优化中的等价性定理。由此获得的子抽样设计为数据点的接受或拒绝提供了简洁准则,便于算法实现。此外,我们提出了一种计算复杂度更低的简化子抽样方法,该方法与D-最优设计存在差异。通过固定子样本量的仿真研究,我们将两种子抽样方案与IBOSS方法进行了对比。