We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are "closest" to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$.
翻译:我们提出一种基于最优设计原理的新型两阶段子抽样算法。第一阶段,利用基于密度的聚类算法从初始子样本中识别预测变量的近似设计空间。接下来,在此设计空间上确定最优近似设计。最后,采用普罗克鲁斯特斯距离、弗罗贝尼乌斯距离及平方根距离等矩阵距离定义剩余子样本,使其样本点与最优设计支持点的距离“最小化”。该方法能体现信息矩阵作为设计点处非负定Fisher信息矩阵加权和的特有性质,适用于包括Fisher信息矩阵秩大于1的模型在内的大类回归模型。