Combining an internal individual-level study with readily available external summary statistics promises major efficiency gains at minimal additional cost, yet heterogeneity between sources can bias estimates for the internal target population. We develop a generalized entropy-balancing integration strategy that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample. Our estimator of the internal-population mean is doubly robust: it remains consistent when either the outcome-regression model or the entropy-balancing modelis correctly specified. When multiple balancing specifications are plausible, we introduce a data-adaptive selection rule. We also provide easy-to-compute, fully estimable diagnostics-based on the Mahalanobis distance and the Pearson chi-square divergence-that pinpoint when integration is guaranteed to strictly outperform the internal sample mean. The approach is implemented in the R package daisy. Simulations and an application to nationwide public-access defibrillation records in Japan demonstrate meaningful precision gains while maintaining bias control under distributional shift.
翻译:将内部个体层面研究与现成可用的外部摘要统计相结合,有望以极低的额外成本实现显著的效率提升,然而数据源之间的异质性可能导致对内部目标总体估计产生偏差。我们开发了一种广义熵平衡集成策略,该方法将外部矩校准至内部协变量分布,明确允许外部样本存在偏差。我们提出的内部总体均值估计量具有双重稳健性:当结果回归模型或熵平衡模型之一被正确设定时,该估计量仍保持一致性。当存在多种可行的平衡设定时,我们引入了数据自适应选择规则。同时,我们基于马氏距离和皮尔逊卡方散度提供了易于计算、完全可估计的诊断指标,这些指标能够精确识别集成方法何时保证严格优于内部样本均值。本方法已通过R软件包daisy实现。基于日本全国公共除颤器记录的模拟与应用表明,该方法在分布偏移条件下能保持偏差控制的同时,实现具有实际意义的精度提升。