In semi-supervised learning, the prevailing understanding suggests that observing additional unlabeled samples improves estimation accuracy for linear parameters only in the case of model misspecification. This paper challenges this notion, demonstrating its inaccuracy in high dimensions. Initially focusing on a dense scenario, we introduce robust semi-supervised estimators for the regression coefficient without relying on sparse structures in the population slope. Even when the true underlying model is linear, we show that leveraging information from large-scale unlabeled data improves both estimation accuracy and inference robustness. Moreover, we propose semi-supervised methods with further enhanced efficiency in scenarios with a sparse linear slope. Diverging from the standard semi-supervised literature, we also allow for covariate shift. The performance of the proposed methods is illustrated through extensive numerical studies, including simulations and a real-data application to the AIDS Clinical Trials Group Protocol 175 (ACTG175).
翻译:在半监督学习中,普遍观点认为仅在模型设定错误的情况下,观察额外无标签样本才能提高线性参数的估计精度。本文对此观点提出质疑,论证了其在高维场景下的不准确性。首先聚焦密集场景,我们引入无需依赖总体斜率稀疏结构的回归系数鲁棒半监督估计方法。即使真实底层模型为线性,我们证明利用大规模无标签数据的信息既能提升估计精度,又能增强推断鲁棒性。此外,针对稀疏线性斜率场景,我们提出效率进一步提升的半监督方法。与标准半监督文献不同,我们同时允许协变量偏移的存在。通过包含仿真实验和艾滋病临床试验组方案175(ACTG175)实际数据应用在内的大量数值研究,验证了所提方法的性能。