We study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-submodular functions. The proposed algorithm achieves exponentially fast parallel run time in the adaptive query model, scaling much better than prior work. Furthermore, our extension allows the use of downward-closed constraints, which can be used to encode certain fairness criteria into the feature selection process. We prove strong approximation guarantees for the algorithm based on standard assumptions. These guarantees are applicable to many parametric models, including Generalized Linear Models. Finally, we demonstrate empirically that the proposed algorithm competes favorably with state-of-the-art techniques for feature selection, on real-world and synthetic datasets.
翻译:我们研究模型构建中最优特征选择这一基本问题。即使采用贪心算法变体,该问题在大规模数据集上仍具有计算挑战性。为应对这一挑战,我们将近期针对子模函数贪心前向选择提出的自适应查询模型,扩展至更快速的非子模函数正交匹配追踪范式。所提算法在自适应查询模型中实现了指数级的并行运行时间加速,性能显著优于先前工作。此外,我们的扩展支持向下闭合约束,可用于将特定公平性标准编码至特征选择过程。我们基于标准假设证明了该算法具有强逼近保证,这些保证适用于包括广义线性模型在内的诸多参数化模型。最后,通过真实与合成数据集的实证研究,证明所提算法在特征选择任务中与最先进技术相比具有竞争力。