Binary Feature Mask Optimization for Feature Selection

We investigate feature selection problem for generic machine learning (ML) models. We introduce a novel framework that selects features considering the predictions of the model. Our framework innovates by using a novel feature masking approach to eliminate the features during the selection process, instead of completely removing them from the dataset. This allows us to use the same ML model during feature selection, unlike other feature selection methods where we need to train the ML model again as the dataset has different dimensions on each iteration. We obtain the mask operator using the predictions of the ML model, which offers a comprehensive view on the subsets of the features essential for the predictive performance of the model. A variety of approaches exist in the feature selection literature. However, no study has introduced a training-free framework for a generic ML model to select features while considering the importance of the feature subsets as a whole, instead of focusing on the individual features. We demonstrate significant performance improvements on the real-life datasets under different settings using LightGBM and Multi-Layer Perceptron as our ML models. Additionally, we openly share the implementation code for our methods to encourage the research and the contributions in this area.

翻译：我们研究了通用机器学习（ML）模型中的特征选择问题。我们提出了一种新颖的框架，该框架基于模型预测选择特征。该框架的创新之处在于采用了一种新颖的特征掩码方法，在特征选择过程中消除特征，而非将其从数据集中完全移除。这使得我们能够在特征选择过程中使用同一个ML模型，而不同于其他特征选择方法——后者因每次迭代数据集维度不同而需重新训练ML模型。我们利用ML模型的预测结果获取掩码算子，从而全面把握对模型预测性能至关重要的特征子集组合。尽管特征选择文献中存在多种方法，但尚无研究针对通用ML模型提出一种无需重训练即可选择特征的框架，且该框架需将特征子集作为一个整体（而非单个特征）考虑其重要性。我们以LightGBM和多层感知器作为ML模型，在真实数据集的不同设置下验证了该方法显著的性能提升。此外，我们公开分享了方法的实现代码，以促进该领域的研究与贡献。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日