Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on designing sampling rules for estimating a scalar parameter in some parametric models or some specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs for model-free estimation of a scalar or multi-dimensional parameter. This paper proposes a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the semiparametric efficiency bound when the parameter is scalar and improve the bound for every component when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce the variance of the resulting estimator in various settings. The implementation of the proposed design is illustrated in a real data analysis.
翻译:在数据科学任务中,不同变量的数据收集成本可能差异巨大。为节约数据收集成本,常采用两阶段设计。在两阶段研究中,第一阶段收集所有受试者的廉价变量,第二阶段依据预定抽样规则对部分受试者测量昂贵变量。两阶段设计下的估计效率高度依赖于抽样规则。现有文献主要关注为某些参数模型中的标量参数估计或特定估计问题设计抽样规则。然而,现实场景通常模型未知,且涉及为标量或多维参数的无模型估计而采用的两阶段设计。本文提出一种基于半参数效率界的极大极小准则,用于设计最优抽样规则。所提方法无需预设模型,适用于一般估计问题。当参数为标量时,所得抽样规则可最小化半参数效率界;当参数为多维时,可改善每个分量的效率界。仿真研究表明,所提设计能在多种设定下降低估计量的方差。通过实际数据分析,展示了所提设计的实施过程。