Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method that differs from the D-optimal design but requires lower computing time. We present a simulation study, comparing both subsampling schemes with the IBOSS method.
翻译:数据缩减是现代技术面临的一项根本性挑战,由于计算限制,传统统计方法在此情境下难以应用。我们针对观测数量极其庞大但协变量极少的线性回归问题展开研究。子抽样旨在从现有原始数据中选择给定比例的数据点。在协变量分布假设下,我们推导出D最优子抽样设计并探讨其理论性质。我们利用最优设计理论的基本概念以及约束凸优化中的对等定理。由此获得的子抽样设计为接受或拒绝数据点提供了简单准则,便于算法实现。此外,我们提出一种简化子抽样方法,该方法虽与D最优设计不同,但计算时间更短。我们通过模拟研究将两种子抽样方案与IBOSS方法进行对比。