We demonstrate how supervised learning can be decomposed into a two-stage procedure, where (1) all model parameters are selected in an unsupervised manner, and (2) the outputs y are added to the model, without changing the parameter values. This is achieved by a new model selection criterion that - in contrast to cross-validation - can be used also without access to y. For linear ridge regression, we bound the asymptotic out-of-sample risk of our method in terms of the optimal asymptotic risk. We also demonstrate that versions of linear and kernel ridge regression, smoothing splines, k-nearest neighbors, random forests, and neural networks, trained without access to y, perform similarly to their standard y-based counterparts. Hence, our results suggest that the difference between supervised and unsupervised learning is less fundamental than it may appear.
翻译:我们展示了监督学习如何被分解为两个阶段的过程:(1) 所有模型参数均以无监督方式选择,(2) 输出y被添加到模型中,而不改变参数值。这是通过一种新的模型选择准则实现的——与交叉验证不同,该准则即使在没有y的情况下也能使用。对于线性岭回归,我们以最优渐近风险为基准,界定了该方法渐近样本外风险的范围。我们还证明了线性核岭回归、平滑样条、k近邻、随机森林和神经网络等方法的无监督版本,其表现与基于y的标准方法相似。因此,我们的结果表明,监督学习与无监督学习之间的差异可能比表面看起来更不本质。