In this paper, we propose a class of non-parametric classifiers, that learn arbitrary boundaries and generalize well. Our approach is based on a novel way to regularize 1NN classifiers using a greedy approach. We refer to this class of classifiers as Watershed Classifiers. 1NN classifiers are known to trivially over-fit but have very large VC dimension, hence do not generalize well. We show that watershed classifiers can find arbitrary boundaries on any dense enough dataset, and, at the same time, have very small VC dimension; hence a watershed classifier leads to good generalization. Traditional approaches to regularize 1NN classifiers are to consider $K$ nearest neighbours. Neighbourhood component analysis (NCA) proposes a way to learn representations consistent with ($n-1$) nearest neighbour classifier, where $n$ denotes the size of the dataset. In this article, we propose a loss function which can learn representations consistent with watershed classifiers, and show that it outperforms the NCA baseline.
翻译:本文提出了一类非参数分类器,能够学习任意边界并具有良好的泛化能力。我们的方法基于一种通过贪心策略正则化1NN分类器的新方式,并将此类分类器称为分水岭分类器。1NN分类器已知易过拟合,且具有极大的VC维数,因此泛化性能不佳。我们证明,分水岭分类器能够在任意密度足够的数据集上找到任意边界,同时具有极小的VC维数,从而获得良好的泛化能力。传统正则化1NN分类器的方法是考虑K个最近邻。邻域分量分析(NCA)提出了一种学习与(n-1)最近邻分类器(其中n表示数据集大小)一致的表示的方法。本文提出了一种能够学习与分水岭分类器一致的表示的损失函数,并证明其性能优于NCA基线方法。