Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on designing sampling rules for estimating a scalar parameter in some parametric models or specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs for model-free estimation of a scalar or multi-dimensional parameter. This paper proposes a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the semiparametric efficiency bound when the parameter is scalar and improve the bound for every component when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce the variance of the resulting estimator in various settings. The implementation of the proposed design is illustrated in a real data analysis.
翻译:在数据科学任务中,不同变量的数据收集成本可能存在显著差异。为节省数据收集成本,常采用两阶段设计。在两阶段研究中,第一阶段为所有受试者收集成本较低的变量,第二阶段则根据预设的抽样规则,对部分受试者测量成本较高的变量。两阶段设计下的估计效率在很大程度上依赖于抽样规则。现有文献主要关注为某些参数模型中的标量参数估计或特定估计问题设计抽样规则。然而,实际场景通常模型未知,且涉及用于标量或多维参数无模型估计的两阶段设计。本文提出一种基于半参数效率界的极大极小准则,以设计最优抽样规则。所提方法具有无模型特性,适用于一般估计问题。当参数为标量时,所得抽样规则可使半参数效率界最小化;当参数为多维时,则可改善各分量的效率界。模拟研究表明,所提设计能在多种设定下降低估计量的方差。通过实际数据分析展示了所提设计的实施过程。