A critical decision point when training predictors using multiple studies is whether studies should be combined or treated separately. We compare two multi-study prediction approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets: 1) merging all of the datasets and training a single learner, and 2) multi-study ensembling, which involves training a separate learner on each dataset and combining the predictions resulting from each learner. For ridge regression, we show analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies. However, as cross-study heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metagenomics.
翻译:在使用多个研究训练预测器时,一个关键决策点是应当合并研究还是分别处理。我们比较了在数据集间预测变量-结局关系存在潜在异质性的情况下两种多研究预测方法:1)合并所有数据集并训练单一学习器;2)多研究集成,即在每个数据集上分别训练学习器,并将各学习器的预测结果进行组合。针对岭回归,我们通过理论分析和模拟验证表明:当各研究间的预测变量-结局关系相对同质时,合并方法比集成方法产生更低的预测误差。然而,随着跨研究异质性的增加,存在一个临界转换点,超过该点后集成方法的表现将优于合并方法。我们给出了多种情境下该临界转换点的解析表达式,研究了其渐近性质,并通过宏基因组学的应用实例说明了如何运用临界转换点理论来决定何时应当合并研究。