Information from various data sources is increasingly available nowadays. However, some of the data sources may produce biased estimation due to commonly encountered biased sampling, population heterogeneity, or model misspecification. This calls for statistical methods to combine information in the presence of biased sources. In this paper, a robust data fusion-extraction method is proposed. The method can produce a consistent estimator of the parameter of interest even if many of the data sources are biased. The proposed estimator is easy to compute and only employs summary statistics, and hence can be applied to many different fields, e.g. meta-analysis, Mendelian randomisation and distributed system. Moreover, the proposed estimator is asymptotically equivalent to the oracle estimator that only uses data from unbiased sources under some mild conditions. Asymptotic normality of the proposed estimator is also established. In contrast to the existing meta-analysis methods, the theoretical properties are guaranteed even if both the number of data sources and the dimension of the parameter diverge as the sample size increases, which ensures the performance of the proposed method over a wide range. The robustness and oracle property is also evaluated via simulation studies. The proposed method is applied to a meta-analysis data set to evaluate the surgical treatment for the moderate periodontal disease, and a Mendelian randomization data set to study the risk factors of head and neck cancer.
翻译:随着多源数据日益可获取,部分数据源因抽样偏差、群体异质性或模型设定错误等问题可能产生有偏估计。针对存在偏差数据源的信息融合需求,本文提出一种稳健的数据融合-提取方法。该方法在多数数据源存在偏差时仍能生成目标参数的一致估计量。所提估计量计算简便且仅需汇总统计量,可广泛应用于元分析、孟德尔随机化及分布式系统等不同领域。在温和条件下,该估计量与仅使用无偏数据源的Oracle估计量渐近等价,并建立了其渐近正态性。与现有元分析方法相比,即使数据源数量及参数维度随样本量增长而发散,该方法仍能保证理论性质,确保其在广泛场景下的有效性。通过模拟研究验证了方法的稳健性与Oracle性质。本文分别将所提方法应用于评估中度牙周炎手术治疗的元分析数据集,以及研究头颈癌危险因素的孟德尔随机化数据集。