We study the problem of lossless feature selection for a $d$-dimensional feature vector $X=(X^{(1)},\dots ,X^{(d)})$ and label $Y$ for binary classification as well as nonparametric regression. For an index set $S\subset \{1,\dots ,d\}$, consider the selected $|S|$-dimensional feature subvector $X_S=(X^{(i)}, i\in S)$. If $L^*$ and $L^*(S)$ stand for the minimum risk based on $X$ and $X_S$, respectively, then $X_S$ is called lossless if $L^*=L^*(S)$. For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor based test statistics to test the hypothesis that $X_S$ is lossless. For the threshold $a_n=\log n/\sqrt{n}$, the corresponding tests are proved to be consistent under conditions on the distribution of $(X,Y)$ that are significantly milder than in previous work. Also, our threshold is dimension-independent, in contrast to earlier methods where for large $d$ the threshold becomes too large to be useful in practice.
翻译:我们研究二元分类及非参数回归中 $d$ 维特征向量 $X=(X^{(1)},\dots ,X^{(d)})$ 与标签 $Y$ 的无损特征选择问题。对于指标集 $S\subset \{1,\dots ,d\}$,考虑选取的 $|S|$ 维特征子向量 $X_S=(X^{(i)}, i\in S)$。若 $L^*$ 和 $L^*(S)$ 分别表示基于 $X$ 和 $X_S$ 的最小风险,则当 $L^*=L^*(S)$ 时称 $X_S$ 为无损的。在分类问题中,最小风险为贝叶斯错误概率;在回归问题中,最小风险为残差方差。我们引入基于最近邻的检验统计量来检验 $X_S$ 为无损这一假设。对于阈值 $a_n=\log n/\sqrt{n}$,在比以往工作显著更弱的 $(X,Y)$ 分布条件下,证明了相应检验的一致性。此外,与先前方法在高维情况下阈值过大而失去实用价值不同,我们的阈值与维度无关。