In many modern applications, a carefully designed primary study provides individual-level data for interpretable modeling, while summary-level external information is available through black-box, efficient, and nonparametric machine-learning predictions. Although summary-level external information has been studied in the data integration literature, there is limited methodology for leveraging external nonparametric machine-learning predictions to improve statistical inference in the primary study. We propose a general empirical-likelihood framework that incorporates external predictions through moment constraints. An advantage of nonparametric machine-learning prediction is that it induces a rich class of valid moment restrictions that remain robust to covariate shift under a mild overlap condition without requiring explicit density-ratio modeling. We focus on multinomial logistic regression as the primary model and address common data-quality issues in external sources, including coarsened outcomes, partially observed covariates, covariate shift, and heterogeneity in generating mechanisms known as concept shift. We establish large-sample properties of the resulting fused estimator, including consistency and asymptotic normality under regularity conditions. Moreover, we provide mild sufficient conditions under which incorporating external predictions delivers a strict efficiency gain relative to the primary-only estimator. Simulation studies and an application to the National Health and Nutrition Examination Survey on multiclass blood-pressure classification.
翻译:在现代应用中,精心设计的原始研究可提供个体级数据进行可解释建模,而黑箱、高效、非参数化的机器学习预测则提供了摘要级外部信息。尽管数据整合文献中已研究了摘要级外部信息,但利用外部非参数化机器学习预测以改进原始研究统计推断的方法仍十分有限。我们提出一个通用经验似然框架,通过矩约束整合外部预测。非参数化机器学习预测的优势在于:它能导出一类丰富的有效矩约束,在温和的重叠条件下对协变量偏移具有鲁棒性,且无需显式密度比建模。我们将多元逻辑回归作为原始模型,并处理外部数据源中常见的质量问题,包括结果粗化、协变量部分缺失、协变量偏移,以及生成机制异质性(即概念偏移)。我们在正则条件下建立了该融合估计量的大样本性质,包括一致性和渐近正态性。此外,我们给出了温和的充分条件,使得整合外部预测能带来相较于仅用原始数据估计量更严格的效率提升。通过模拟研究及美国国家健康与营养调查中多级血压分类的实际应用验证了该方法。