In high-dimensional semi-supervised linear regression, prediction-powered inference (PPI) corrects an external predictor with a rectifier estimated from the labeled data. In a linear model, however, this rectifier cancels the predictor: PPI and PPI++ reduce to ordinary least squares and can inflate variance when the predictor is close to the oracle. We propose the Debiased External-model-Assisted Lasso (DEAL), which routes the external estimator and the unlabeled covariates into the variance of a debiased estimator, with a bias-aware, cross-fitted shrinkage step that adapts across target-only, near-oracle, and biased-but-informative regimes. We prove coordinate-wise asymptotic normality with an adaptive variance, extend validity to the projection parameter under misspecification and nonlinear labelers, and show that, at a common unlabeled budget, DEAL intervals are shorter than those of debiased Lasso, PPI, and PPI++; a shift-aware variant preserves coverage under covariate shift. In simulations, DEAL intervals are 0.49-0.87 of the debiased-Lasso length, and across six real-data applications spanning astronomy, chemistry, proteomics, and oncology, the last using a large-language-model oracle, they tighten in every case, with median length ratios of 0.23-0.53.
翻译:在高维半监督线性回归中,预测驱动推断(PPI)利用从标记数据估计的校正器对外部预测器进行修正。然而,在线性模型中,该校正器会抵消预测器的作用:PPI和PPI++退化为普通最小二乘法,且当预测器接近最优预测时可能增大方差。我们提出去偏外部模型辅助Lasso(DEAL),该方法将外部估计量与未标记协变量引入去偏估计量的方差中,并采用偏差感知的交叉拟合收缩步骤,该步骤可自适应地适应纯目标域、近最优域和有偏但信息丰富域三种场景。我们证明了坐标渐近正态性及自适应方差,将有效性扩展到错误设定和非线性标注器下的投影参数,并表明在相同的未标记样本预算下,DEAL区间的长度短于去偏Lasso、PPI和PPI++;一种偏移感知变体可在协变量偏移下保持覆盖。在模拟实验中,DEAL区间长度为去偏Lasso的0.49-0.87倍;在涵盖天文学、化学、蛋白质组学和肿瘤学(最后一项采用大型语言模型作为预测器)的六个实际数据应用中,DEAL在所有案例中均缩短了区间长度,中位长度比为0.23-0.53。