Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. Motivated by the increasing popularity of tabular deep learning, we construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.
翻译:学术界的表格数据基准通常包含少量精心筛选的特征。然而,数据科学家在实践中往往会在数据集中收集尽可能多的特征,甚至从现有特征中工程化生成新特征。为防止后续下游建模中的过拟合,从业者通常使用自动化特征选择方法,识别出包含信息量的精简特征子集。现有针对表格数据特征选择的基准要么采用经典下游模型、合成玩具数据集,要么未基于下游性能评估特征选择器。受表格数据深度学习日益普及的启发,我们构建了一个具有挑战性的特征选择基准,该基准通过真实数据集及多种生成无关特征的方法,在下游神经网络(包括Transformer)上进行评估。我们还提出了基于输入梯度的神经网络Lasso变体,在处理如从噪声特征或二阶特征中选择等难题时,该方法的性能优于经典特征选择方法。