The applications of traditional statistical feature selection methods to high-dimension, low sample-size data often struggle and encounter challenging problems, such as overfitting, curse of dimensionality, computational infeasibility, and strong model assumption. In this paper, we propose a novel two-step nonparametric approach called Deep Feature Screening (DeepFS) that can overcome these problems and identify significant features with high precision for ultra high-dimensional, low-sample-size data. This approach first extracts a low-dimensional representation of input data and then applies feature screening based on multivariate rank distance correlation recently developed by Deb and Sen (2021). This approach combines the strengths of both deep neural networks and feature screening, and thereby has the following appealing features in addition to its ability of handling ultra high-dimensional data with small number of samples: (1) it is model free and distribution free; (2) it can be used for both supervised and unsupervised feature selection; and (3) it is capable of recovering the original input data. The superiority of DeepFS is demonstrated via extensive simulation studies and real data analyses.
翻译:传统统计特征选择方法在处理高维低样本量数据时,常面临过拟合、维度灾难、计算不可行性及强模型假设等棘手问题。本文提出一种名为深度特征筛选(Deep Feature Screening, DeepFS)的新型两步非参数方法,该方法能够克服上述问题,并为超高维低样本量数据实现高精度重要特征识别。该方法首先提取输入数据的低维表示,随后基于Deb与Sen(2021)最新提出的多元秩距离相关性进行特征筛选。该技术融合了深度神经网络与特征筛选的双重优势,除具备处理小样本超高维数据的能力外,还具有以下突出特性:(1)无模型且无分布假设;(2)适用于监督与非监督特征选择;(3)可实现原始输入数据的重构。通过大量仿真实验与真实数据分析,充分验证了DeepFS的优越性能。