A fundamental problem in supervised learning is to find a good set of features or distance measures. If the new set of features is of lower dimensionality and can be obtained by a simple transformation of the original data, they can make the model understandable, reduce overfitting, and even help to detect distribution drift. We propose a supervised dimensionality reduction method Gradient Boosting Mapping (GBMAP), where the outputs of weak learners -- defined as one-layer perceptrons -- define the embedding. We show that the embedding coordinates provide better features for the supervised learning task, making simple linear models competitive with the state-of-the-art regressors and classifiers. We also use the embedding to find a principled distance measure between points. The features and distance measures automatically ignore directions irrelevant to the supervised learning task. We also show that we can reliably detect out-of-distribution data points with potentially large regression or classification errors. GBMAP is fast and works in seconds for dataset of million data points or hundreds of features. As a bonus, GBMAP provides a regression and classification performance comparable to the state-of-the-art supervised learning methods.
翻译:监督学习中的一个基本问题是寻找良好的特征集或距离度量。若新特征集维度更低且可通过原始数据的简单变换获得,则能使模型更易理解、减少过拟合,甚至有助于检测分布漂移。我们提出一种监督式降维方法——梯度提升映射(Gradient Boosting Mapping, GBMAP),其中弱学习器(定义为单层感知机)的输出定义了嵌入表示。实验表明,嵌入坐标能为监督学习任务提供更优质的特征,使简单线性模型在性能上可与最先进的回归器与分类器相媲美。我们还利用该嵌入寻找数据点之间具有原则性的距离度量。这些特征与距离度量会自动忽略与监督学习任务无关的维度。此外,我们证明该方法能可靠地检测可能产生较大回归或分类误差的分布外数据点。GBMAP计算高效,可在数秒内处理百万级数据点或数百维特征的数据集。作为额外优势,GBMAP在回归与分类性能上可与最先进的监督学习方法相媲美。