In many modern machine learning pipelines, abundant pretrained representations serve as noisy proxy covariates, while task-specific labels remain scarce. We study semi-supervised regression in this setting, and propose a simple two stage estimator that learns kernel eigenfeatures from all proxy covariates and fits a ridge predictor on labeled data. We derive finite sample bounds showing that fast labeled sample rates are recovered when proxy perturbation is controlled and unlabeled proxy covariates are sufficiently abundant. We also show that distribution regression is a direct special case, with analogous guarantees when the finite bag size is large enough. Experiments show consistent gains over supervised and semi-supervised baselines, especially in low label regimes.
翻译:在许多现代机器学习流程中,丰富的预训练表示充当了带噪声的代理协变量,而特定任务的标签却仍然稀缺。我们在此背景下研究半监督回归问题,并提出一种简单的两阶段估计器:先从所有代理协变量中学习核本征特征,再在带标签数据上拟合岭回归预测器。我们推导了有限样本界,表明当代理扰动得到控制且无标签代理协变量足够丰富时,标签样本的快速收敛率可被恢复。我们还指出,分布回归是这一框架的直接特例,当有限包容量足够大时可获得类似的保证。实验结果表明,该方法在监督学习和半监督学习基线上均持续取得改进,尤其在低标签率场景下表现突出。