Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.
翻译:子数据选择是研究从大数据中选取具有代表性小样本的方法,其分析过程快速且统计效率高。现有子数据选择方法假设大数据可以通过底层模型合理建模,例如针对分类问题的(多项)逻辑回归模型。当底层建模假设正确时,这些方法效果极佳,但在假设不成立时往往结果较差。本文提出一种面向分类问题的无模型子数据选择方法,所得子数据称为PED子数据。该方法通过决策树对数据进行划分,随后从每个划分分量中选取适当样本。采用随机森林对所选子数据进行分析。本方法适用于响应变量具有任意类别数的情况,且可处理分类预测变量和连续预测变量。理论分析表明,PED子数据的基尼系数小于均匀子数据。此外,通过大量模拟实验和真实数据集验证,PED子数据在分类精度上优于其他竞争方法。