Integration of Individual Participant and Aggregate Data Under Dataset Shift: Summary Statistic Comparison and Scalable Computation

Integrated IPD-AD analysis, which combines individual participant data (IPD) with aggregate data (AD), is increasingly recognized as an effective strategy for generating more reliable and generalizable inferences from heterogeneous studies. While most existing work has focused on algorithmic approaches, this paper investigates a complementary yet underexplored question: how different forms of AD influence the efficiency of data integration. Working within a constrained maximum likelihood estimation framework, we compare commonly reported summary statistics and show that subgroup-specific summaries can substantially improve estimation efficiency. In particular, we find that AD derived from outcome-stratified subgroups (e.g., cases and controls) consistently yield greater efficiency gains than those based on covariate-stratified subgroups (e.g., age or exposure categories), especially when the outcome is continuous. Although outcome-stratified summaries are commonly reported for discrete outcomes, they are rarely provided when the outcome is continuous. Our findings therefore support the routine inclusion of outcome-stratified summaries for continuous endpoints in trial reports and public data repositories to facilitate more efficient evidence synthesis. We further extend the constrained maximum likelihood framework to accommodate dataset shift and develop a fast, non-iterative estimation procedure to improve numerical stability and scalability. We illustrate the proposed methodology with two applications: an analysis of income data under covariate shift and an analysis of housing data under prior probability shift.

翻译：整合个体参与者数据（IPD）与汇总数据（AD）的IPD-AD综合分析，正日益被视为从异质性研究中生成更可靠、更可推广推断的有效策略。尽管现有研究多聚焦于算法途径，本文探讨了一个互补却尚未充分探索的问题：不同形式的汇总数据如何影响数据整合的效率。在约束最大似然估计框架下，我们比较了常用的汇总统计量，并证明特定亚组的汇总统计量能显著提升估计效率。特别地，我们发现基于结局分层亚组（如病例与对照）衍生的汇总数据，相较于基于协变量分层亚组（如年龄或暴露类别）的汇总数据，能持续带来更大的效率增益，这在结局为连续变量时尤为明显。尽管结局分层汇总统计量在离散结局中常被报告，但在连续结局中却鲜少提供。因此，我们的研究结果支持在试验报告和公共数据存储库中常规纳入连续终点的结局分层汇总统计量，以促进更高效的证据整合。我们进一步扩展了约束最大似然估计框架以适应数据集偏移，并开发了一种快速、非迭代的估计程序以提升数值稳定性和可扩展性。通过两个应用实例阐释了所提出的方法：协变量偏移下的收入数据分析以及先验概率偏移下的住房数据分析。