This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and non-vanishing noise. A generalized least squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, that requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.
翻译:本文考虑在假设模型下对高维特征进行二分类,该模型具有低维隐高斯混合结构且噪声非零。采用广义最小二乘估计器估计最优分离超平面的方向。所估计的超平面在训练数据上具有插值性质。尽管根据线性回归领域的最新结果,方向向量可被一致估计,但直接代入估计量却无法一致地估计截距项。通过一个需要独立保留样本的简单修正,可使该过程在许多场景下达到极小化最优。修正后的过程仍能保留插值性质,但令人惊讶的是,该性质取决于标签的编码方式。