In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets.
翻译:在当代数据分析领域,大规模数据集的使用已变得至关重要,尽管这通常对计算时间和内存提出了相当高的要求。虽然现有大量研究提供了在子样本上进行分析且能最小化效率损失的最优子抽样方法,但它们显著缺乏用于明智选择最优子样本量的工具。为弥补这一不足,本研究引入了专为选择最优子样本量而设计的工具。我们聚焦于三种设定:针对罕见事件生存数据的Cox回归模型,以及针对平衡与不平衡数据集的Logistic回归。此外,我们提出了一种专为不平衡数据Logistic回归设计的新型最优子抽样流程。这些工具和流程的有效性通过广泛的模拟研究以及对两个大型数据集的细致分析得到了验证。