Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method that differs from the D-optimal design but requires lower computing time. We present a simulation study, comparing both subsampling schemes with the IBOSS method.
翻译:数据缩减是现代技术面临的一项基本挑战,由于计算限制,经典统计方法在此场景下无法适用。我们考虑观测数量极大但协变量较少的线性回归问题。子抽样的目标是从现有原始数据中选取特定比例的子集。在协变量分布假设下,我们推导出D-最优子抽样设计并研究其理论性质。我们利用了最优设计理论的基本概念及约束凸优化中的等价定理。由此获得的子抽样设计提供了接受或拒绝数据点的简单规则,便于算法实现。此外,我们提出了一种简化子抽样方法,虽与D-最优设计不同,但所需计算时间更少。我们通过模拟研究将两种子抽样方案与IBOSS方法进行了比较。