Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
翻译:特征选择是指在预算约束下,为机器学习模型选取最优特征子集以最大化模型质量的问题。针对神经网络,现有方法(包括基于$\ell_1$正则化、注意力机制及其他技术的方法)通常在一次评估中直接选取整个特征子集,忽视了特征选择过程中的残差价值,即给定部分特征已选中时特征的边际贡献。我们提出一种名为序列注意力(Sequential Attention)的特征选择算法,在神经网络上实现了当前最优的实证结果。该算法基于高效的前向贪婪选择单次实现,并利用每步的注意力权重作为特征重要性的代理指标。针对线性回归场景,我们从理论上阐明该算法的变体等价于经典正交匹配追踪(OMP)算法,从而继承了其所有可证明的收敛保证。本文的理论与实证分析为注意力机制的有效性及其与过参数化的关联提供了新视角,这些结论本身可能具有独立的研究价值。