A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning

Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. Motivated by the increasing popularity of tabular deep learning, we construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.

翻译：学术界的表格数据基准通常包含少量精心筛选的特征。然而，数据科学家在实践中往往会在数据集中收集尽可能多的特征，甚至从现有特征中工程化生成新特征。为防止后续下游建模中的过拟合，从业者通常使用自动化特征选择方法，识别出包含信息量的精简特征子集。现有针对表格数据特征选择的基准要么采用经典下游模型、合成玩具数据集，要么未基于下游性能评估特征选择器。受表格数据深度学习日益普及的启发，我们构建了一个具有挑战性的特征选择基准，该基准通过真实数据集及多种生成无关特征的方法，在下游神经网络（包括Transformer）上进行评估。我们还提出了基于输入梯度的神经网络Lasso变体，在处理如从噪声特征或二阶特征中选择等难题时，该方法的性能优于经典特征选择方法。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日