Sequential Attention for Feature Selection

Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.

翻译：特征选择是指在预算约束下，为机器学习模型选取最优特征子集以最大化模型质量的问题。针对神经网络，现有方法（包括基于$\ell_1$正则化、注意力机制及其他技术的方法）通常在一次评估中直接选取整个特征子集，忽视了特征选择过程中的残差价值，即给定部分特征已选中时特征的边际贡献。我们提出一种名为序列注意力（Sequential Attention）的特征选择算法，在神经网络上实现了当前最优的实证结果。该算法基于高效的前向贪婪选择单次实现，并利用每步的注意力权重作为特征重要性的代理指标。针对线性回归场景，我们从理论上阐明该算法的变体等价于经典正交匹配追踪(OMP)算法，从而继承了其所有可证明的收敛保证。本文的理论与实证分析为注意力机制的有效性及其与过参数化的关联提供了新视角，这些结论本身可能具有独立的研究价值。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

专知会员服务

41+阅读 · 2020年10月13日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日