FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions

Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-function regression problem, and we show how to extend it to the scalar-on-function framework. Our method, called FAStEN, combines functional data, optimization, and machine learning techniques to perform feature selection and parameter estimation simultaneously. We exploit the properties of Functional Principal Components and the sparsity inherent to the Dual Augmented Lagrangian problem to significantly reduce computational cost, and we introduce an adaptive scheme to improve selection accuracy. In addition, we derive asymptotic oracle properties, which guarantee estimation and selection consistency for the proposed FAStEN estimator. Through an extensive simulation study, we benchmark our approach to the best existing competitors and demonstrate a massive gain in terms of CPU time and selection performance, without sacrificing the quality of the coefficients' estimation. The theoretical derivations and the simulation study provide a strong motivation for our approach. Finally, we present an application to brain fMRI data from the AOMIC PIOP1 study. Complete FAStEN code is provided at https://github.com/IBM/funGCN.

翻译：函数回归分析是当代众多科学应用中的成熟工具。涉及大规模复杂数据集的回归问题普遍存在，而特征选择对于避免过拟合和实现精确预测至关重要。本文提出一种新颖、灵活且超高效的方法，用于在稀疏高维函数对函数回归问题中执行特征选择，并展示了如何将其扩展至标量对函数框架。我们提出的方法称为FAStEN，它融合了函数型数据、优化和机器学习技术，能够同步执行特征选择和参数估计。该方法利用函数主成分的特性及对偶增广拉格朗日问题固有的稀疏性，显著降低了计算成本，并引入自适应机制以提升选择精度。此外，我们推导了渐近oracle性质，为所提出的FAStEN估计量提供了估计与选择一致性的理论保证。通过广泛的模拟研究，我们将本方法与现有最优方法进行基准测试，结果表明其在CPU时间和选择性能方面均获得显著提升，且未牺牲系数估计的质量。理论推导与模拟研究为我们的方法提供了有力支撑。最后，我们展示了该方法在AOMIC PIOP1研究脑部fMRI数据中的应用。完整的FAStEN代码发布于https://github.com/IBM/funGCN。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日