The applications of traditional statistical feature selection methods to high-dimension, low sample-size data often struggle and encounter challenging problems, such as overfitting, curse of dimensionality, computational infeasibility, and strong model assumption. In this paper, we propose a novel two-step nonparametric approach called Deep Feature Screening (DeepFS) that can overcome these problems and identify significant features with high precision for ultra high-dimensional, low-sample-size data. This approach first extracts a low-dimensional representation of input data and then applies feature screening based on multivariate rank distance correlation recently developed by Deb and Sen (2021). This approach combines the strengths of both deep neural networks and feature screening, and thereby has the following appealing features in addition to its ability of handling ultra high-dimensional data with small number of samples: (1) it is model free and distribution free; (2) it can be used for both supervised and unsupervised feature selection; and (3) it is capable of recovering the original input data. The superiority of DeepFS is demonstrated via extensive simulation studies and real data analyses.
翻译:传统统计特征选择方法在处理高维低样本量数据时,通常面临过拟合、维度灾难、计算不可行以及强模型假设等严峻挑战。本文提出一种新颖的两步非参数方法——深度特征筛选(DeepFS),该方法能够克服上述难题,实现对超高维低样本量数据中显著特征的高精度识别。该技术首先提取输入数据的低维表征,随后基于Deb与Sen(2021)最新提出的多元秩距离相关性进行特征筛选。本方法融合了深度神经网络与特征筛选的双重优势,除具备处理小样本超高维数据的能力外,还具有以下突出特性:(1)无需模型与分布假设;(2)可同时应用于监督与非监督特征选择;(3)具备原始输入数据的重构能力。通过大量模拟研究与真实数据分析,充分验证了DeepFS方法的优越性。