Unlabeled data is a key component of modern machine learning. In general, the role of unlabeled data is to impose a form of smoothness, usually from the similarity information encoded in a base kernel, such as the $\epsilon$-neighbor kernel or the adjacency matrix of a graph. This work revisits the classical idea of spectrally transformed kernel regression (STKR), and provides a new class of general and scalable STKR estimators able to leverage unlabeled data. Intuitively, via spectral transformation, STKR exploits the data distribution for which unlabeled data can provide additional information. First, we show that STKR is a principled and general approach, by characterizing a universal type of "target smoothness", and proving that any sufficiently smooth function can be learned by STKR. Second, we provide scalable STKR implementations for the inductive setting and a general transformation function, while prior work is mostly limited to the transductive setting. Third, we derive statistical guarantees for two scenarios: STKR with a known polynomial transformation, and STKR with kernel PCA when the transformation is unknown. Overall, we believe that this work helps deepen our understanding of how to work with unlabeled data, and its generality makes it easier to inspire new methods.
翻译:未标注数据是现代机器学习的关键组成部分。通常,未标注数据通过施加某种形式的平滑性来发挥作用,这种平滑性通常源于基础核中编码的相似性信息,例如$\epsilon$邻域核或图的邻接矩阵。本文重新审视了谱变换核回归(STKR)的经典思想,并提出了一类新的通用且可扩展的STKR估计器,能够有效利用未标注数据。直观而言,通过谱变换,STKR利用了未标注数据可提供额外信息的数据分布。首先,我们通过刻画一种通用类型的“目标平滑性”,并证明任何足够平滑的函数均可通过STKR学习,表明STKR是一种具有原则性的通用方法。其次,我们针对归纳学习场景和一般变换函数提供了可扩展的STKR实现,而此前的工作主要局限于直推学习场景。第三,我们推导了两种场景下的统计保证:已知多项式变换下的STKR,以及变换未知时基于核主成分分析(kernel PCA)的STKR。总体而言,我们相信这项工作有助于深化对如何利用未标注数据的理解,其通用性也使得能够更易于启发新方法。