Improving prediction models by incorporating external data with weights based on similarity

In clinical settings, we often face the challenge of building prediction models based on small observational data sets. For example, such a data set might be from a medical center in a multi-center study. Differences between centers might be large, thus requiring specific models based on the data set from the target center. Still, we want to borrow information from the external centers, to deal with small sample sizes. There are approaches that either assign weights to each external data set or each external observation. To incorporate information on differences between data sets and observations, we propose an approach that combines both into weights that can be incorporated into a likelihood for fitting regression models. Specifically, we suggest weights at the data set level that incorporate information on how well the models that provide the observation weights distinguish between data sets. Technically, this takes the form of inverse probability weighting. We explore different scenarios where covariates and outcomes differ among data sets, informing our simulation design for method evaluation. The concept of effective sample size is used for understanding the effectiveness of our subgroup modeling approach. We demonstrate our approach through a clinical application, predicting applied radiotherapy doses for cancer patients. Generally, the proposed approach provides improved prediction performance when external data sets are similar. We thus provide a method for quantifying similarity of external data sets to the target data set and use this similarity to include external observations for improving performance in a target data set prediction modeling task with small data.

翻译：在临床环境中，我们常面临基于小规模观测数据集构建预测模型的挑战。例如，此类数据集可能来自多中心研究中的某个医疗中心。不同中心间可能存在显著差异，因此需要基于目标中心数据集构建特定模型。尽管如此，我们仍希望借助外部中心的信息来处理小样本问题。现有方法通常对每个外部数据集或每个外部观测值分配权重。为整合数据集间与观测值间的差异信息，我们提出一种将二者结合为权重的方案，该权重可纳入拟合回归模型的似然函数中。具体而言，我们建议在数据集层面设置权重，该权重融合了提供观测权重的模型区分不同数据集的能力信息。从技术层面看，这采用了逆概率加权的形式。我们探索了协变量与结局变量在数据集间存在差异的不同场景，为方法评估的模拟设计提供依据。通过有效样本量的概念，我们阐释了所提出的亚组建模方法的有效性。我们通过一项临床应用（预测癌症患者的放疗剂量）展示了该方法。总体而言，当外部数据集与目标数据集相似时，所提方法能有效提升预测性能。因此，我们提供了一种量化外部数据集与目标数据集相似度的方法，并利用该相似度将外部观测值纳入小数据目标数据集的预测建模任务中以提升性能。