Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
翻译:特征选择问题旨在为机器学习模型挑选一个特征子集,在预算约束下最大化模型质量。对于神经网络而言,现有方法(包括基于ℓ₁正则化、注意力机制及其他技术的方法)通常仅通过单轮评估选择全部特征子集,忽略了特征选择过程中的残差值(即某个特征在其他特征已被选定的条件下的边际贡献)。我们提出一种名为“顺序注意力”(Sequential Attention)的特征选择算法,该算法在神经网络上取得了当前最优的实证表现。该算法基于高效的单遍贪婪前向选择实现,并利用每一步的注意力权重作为特征重要性的代理指标。我们针对线性回归场景给出了该算法的理论分析,证明其在该场景下的适配版本等价于经典的正交匹配追踪(OMP)算法,从而继承了该算法的所有可证明保证。理论分析与实证研究为注意力机制的有效性及其与过参数化的关联提供了新的解释,这些发现可能具有独立的研究价值。