Suppose we have available individual data from an internal study and various types of summary statistics from relevant external studies. External summary statistics have been used as constraints on the internal data distribution, which promised to improve the statistical inference in the internal data; however, the additional use of external summary data may lead to paradoxical results: efficiency loss may occur if the uncertainty of summary statistics is not negligible and large estimation bias can emerge even if the bias of external summary statistics is small. We investigate these paradoxical results in a semiparametric framework. We establish the semiparametric efficiency bound for estimating a general functional of the internal data distribution, which is shown to be no larger than that using only internal data. We propose a data-fused efficient estimator that achieves this bound so that the efficiency paradox is resolved. Besides, a debiased estimator is further proposed which has selection consistency property by employing adaptive lasso penalty so that the resultant estimator can achieve the same asymptotic distribution as the oracle one that uses only unbiased summary statistics, which resolves the bias paradox. Simulations and application to a Helicobacter pylori infection dataset are used to illustrate the proposed methods.
翻译:假设我们拥有来自内部研究的个体数据以及相关外部研究中各类汇总统计量。外部汇总统计量已被用作内部数据分布的约束条件,有望改进内部数据的统计推断;然而,额外使用外部汇总数据可能导致悖论性结果:当汇总统计量的不确定性不可忽略时可能产生效率损失,且即使外部汇总统计量的偏差很小,仍可能出现较大的估计偏差。我们在半参数框架下研究这些悖论性结果。我们建立了估计内部数据分布一般泛函的半参数效率界,证明该界限不大于仅使用内部数据时的效率界。我们提出一种数据融合的高效估计量,能够达到该界限,从而解决了效率悖论。此外,我们进一步提出一种去偏估计量,通过采用自适应套索惩罚具有选择一致性性质,使所得估计量能够达到与仅使用无偏汇总统计量的理想估计量相同的渐近分布,从而解决了偏差悖论。通过模拟实验和幽门螺杆菌感染数据集的应用,验证了所提出方法的有效性。