The problem of selecting a handful of truly relevant variables in supervised machine learning algorithms is a challenging problem in terms of untestable assumptions that must hold and unavailability of theoretical assurances that selection errors are under control. We propose a distribution-free feature selection method, referred to as Data Splitting Selection (DSS) which controls False Discovery Rate (FDR) of feature selection while obtaining a high power. Another version of DSS is proposed with a higher power which "almost" controls FDR. No assumptions are made on the distribution of the response or on the joint distribution of the features. Extensive simulation is performed to compare the performance of the proposed methods with the existing ones.
翻译:在监督机器学习算法中,选择少量真正相关变量的问题颇具挑战性,其原因在于必须满足的不可检验假设以及缺乏选择误差受控的理论保证。我们提出了一种无分布假设的特征选择方法,称为数据分裂选择(DSS),该方法在控制特征选择错误发现率(FDR)的同时能获得较高统计功效。文中还提出了另一种更高功效版本的DSS,该版本"几乎"能控制FDR。该方法未对响应变量的分布或特征变量的联合分布做任何假设。通过大量仿真实验,将所提方法与现有方法的性能进行了比较。