A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data are becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don't. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don't perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes.
翻译:统计学的一个基本方面是整合来自不同来源的数据。经典情况下,费舍尔等人关注的是如何整合同质(或仅轻微异质)的数据集。近来,随着数据越来越容易获取,是否应整合来自不同来源的数据集这一问题变得愈发相关。现有文献将这一问题仅视为两种答案:整合或不整合。在此,我们采用一种不同的方法,其动机源于压缩估计文献中的信息共享原则。具体而言,我们偏离“整合/不整合”的二元视角,提出一个“旋钮”参数,用于控制两个数据源被整合的程度。研究表明,该“旋钮”参数应被旋转至何种程度取决于多种因素,例如,不同数据源的信息量(以费舍尔信息衡量)。在广义线性模型的背景下,这种更细致的数据整合框架能导出相对简单的参数估计以及有效的检验/置信区间。此外,我们从理论和实证两方面证明,与其它二元数据整合方案相比,根据我们的建议设置“旋钮”参数能带来更高效的估计。