Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components analysis, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme values of the first principal component for validation. Through extensive simulations and an application to the National Health and Nutrition Examination Survey (NHANES), the proposed strategy offered simultaneous efficiency gains across multiple models of interest. Its advantages persisted across various real-world scenarios, including correlated or heterogeneous measurement error. When designing a validation study, concentrating on a single model may be short-sighted. Strategically allocating resources more broadly balances multiple analytical goals simultaneously. Employing dimension reduction before sampling will allow this strategy to scale up well to big-data applications with many error-prone exposures.

翻译：两阶段抽样为生物医学数据库中错误倾向协变量测量提供了一种经济有效的验证方法。第一阶段收集全研究样本的廉价或易获取信息，第二阶段对部分患者进行高成本验证（如专家病历审查）以获取更精确数据。当需要平衡主、次要分析时，竞争性模型和优先级差异可能导致最具信息量的第二阶段抽样标准目标模糊。极端尾部分层抽样通过选取特定指标（如协变量或残差）最小值和最大值的患者，聚焦于对Fisher信息贡献最大的观测值，能在单一分析目标下为两阶段研究提供显著统计效率。我们提出了一种直观易用的方法，将极端尾部分层抽样扩展至平衡并优先解释多个感兴趣模型中最大程度的变异性。通过主成分分析，我们简明总结所有模型错误倾向暴露变量的内在变异性，进而选取第一主成分极端值的患者进行验证。经大量模拟及美国国家健康与营养调查实证应用，该策略在多个感兴趣模型间同步实现了效率提升，且在异质或相关测量误差等实际场景中均保持优势。设计验证研究时，仅关注单一模型可能目光短浅。战略性地跨领域分配资源可同时平衡多重分析目标。抽样前采用降维策略将使其能良好扩展至含大量错误倾向暴露变量的大数据应用。