Suppose we have available individual data from an internal study and various types of summary statistics from relevant external studies. External summary statistics have been used as constraints on the internal data distribution, which promised to improve the statistical inference in the internal data; however, the additional use of external summary data may lead to paradoxical results: efficiency loss may occur if the uncertainty of summary statistics is not negligible and large estimation bias can emerge even if the bias of external summary statistics is small. We investigate these paradoxical results in a semiparametric framework. We establish the semiparametric efficiency bound for estimating a general functional of the internal data distribution, which is shown to be no larger than that using only internal data. We propose a data-fused efficient estimator that achieves this bound so that the efficiency paradox is resolved. Besides, a debiased estimator is further proposed which has selection consistency property by employing adaptive lasso penalty so that the resultant estimator can achieve the same asymptotic distribution as the oracle one that uses only unbiased summary statistics, which resolves the bias paradox. Simulations and application to a Helicobacter pylori infection dataset are used to illustrate the proposed methods.
翻译:假设我们拥有来自内部研究的个体数据,以及相关外部研究中各类汇总统计量。外部汇总统计量被用作内部数据分布的约束条件,旨在提升内部数据的统计推断效果;然而,额外使用外部汇总数据可能导致悖论性结果:若汇总统计量的不确定性不可忽略,即便其偏差较小,仍可能引发效率损失与较大的估计偏差。本文在半参数框架下探究这些悖论性结果。我们建立了估计内部数据分布一般泛函的半参数效率界,并证明该界不高于仅使用内部数据时的效率界。为此提出一种数据融合有效估计量,可达到此效率界,从而解决效率悖论。此外,进一步提出一种去偏估计量,通过自适应lasso惩罚实现选择一致性,使最终估计量达到与仅使用无偏汇总统计量的基准估计量相同的渐近分布,从而解决偏差悖论。通过模拟研究及幽门螺杆菌感染数据集的应用,对所提方法进行验证。