Two-phase sampling is commonly adopted for reducing cost and improving estimation efficiency. In many two-phase studies, the outcome and some cheap covariates are observed for a large sample in Phase I, and expensive covariates are obtained for a selected subset of the sample in Phase II. As a result, the analysis of the association between the outcome and covariates faces a missing data problem. Complete-case analysis, which relies solely on the Phase II sample, is generally inefficient. In this paper, we study a two-step estimation approach, which first obtains an estimator using the complete data, and then updates it using an asymptotically mean-zero estimator obtained from a working model between the outcome and cheap covariates using the full data. This two-step estimator is asymptotically at least as efficient as the complete-data estimator and is robust to misspecification of the working model. We propose a kernel-based method to construct a two-step estimator that achieves optimal efficiency. Additionally, we develop a simple joint update approach based on multiple working models to approximate the optimal estimator when a fully nonparametric kernel approach is infeasible. We illustrate the proposed methods with various outcome models. We demonstrate their advantages over existing approaches through simulation studies and provide an application to a major cancer genomics study.
翻译:两阶段抽样通常用于降低成本并提高估计效率。在许多两阶段研究中,第一阶段在大样本中观测结果变量及部分低成本协变量,第二阶段在选定子样本中获取高成本协变量。这导致结果变量与协变量之间的关联性分析面临数据缺失问题。仅依赖第二阶段样本的完整案例分析通常效率较低。本文研究一种两步估计方法:首先利用完整数据获得初始估计量,随后通过基于结果变量与低成本协变量之间工作模型构建的渐近零均值估计量(使用全样本)对其进行更新。该两步估计量在渐近意义上至少与完整数据估计量同等有效,且对工作模型的误设具有稳健性。我们提出一种基于核函数的方法来构建达到最优效率的两步估计量。此外,当完全非参数核方法不可行时,我们开发了一种基于多个工作模型的联合更新方法以逼近最优估计量。我们通过多种结果模型对所提方法进行阐释,通过模拟研究证明其相对于现有方法的优势,并将其应用于一项大型癌症基因组学研究。