Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme principal components for validation. Through simulations and an application to the National Health and Nutrition Examination Survey (NHANES), the proposed strategy offered simultaneous efficiency gains across multiple models of interest. Its advantages persisted across various real-world scenarios. When designing a validation study, concentrating on a single model may be short-sighted. Strategically allocating resources more broadly balances multiple analytical goals simultaneously. Employing dimension reduction before sampling will allow this strategy to scale up well to big-data applications with many error-prone covariates.

翻译：两阶段抽样为验证生物医学数据库中易错的协变量测量提供了一种经济有效的方法。在第一阶段，为整个研究收集廉价或易于获取的信息。随后，在第二阶段对患者子集进行成本密集的验证（例如专家图表审查）以获取更精确的数据。在平衡主要分析与次要分析时，相互竞争的模型和优先级可能导致第二阶段最具信息量的抽样标准目标定义不清。极端尾部抽样（ETS）通过选择特定量（如协变量或残差）取值最小和最大的患者，能够针对对费舍尔信息贡献最大的观测值，从而在聚焦单一分析目标的两阶段研究中提供显著的统计效率。我们提出了一种直观易用的方法，将ETS扩展至平衡并优先解释多个关注模型的最大变异量。利用主成分分析，我们简洁地总结了所有模型中易错暴露变量的固有变异性。随后，我们对具有最极端主成分的患者进行抽样验证。通过模拟实验及在美国国家健康与营养调查（NHANES）中的应用，所提出的策略在多个关注模型中实现了同步的效率提升。其优势在各种实际场景中持续存在。在设计验证研究时，仅关注单一模型可能缺乏远见。通过更广泛地战略分配资源，可同时平衡多个分析目标。在抽样前采用降维技术将使该策略能够良好地扩展至具有众多易错协变量的大数据应用场景。