Beyond Discrete Selection: Continuous Embedding Space Optimization for Generative Feature Selection

The goal of Feature Selection - comprising filter, wrapper, and embedded approaches - is to find the optimal feature subset for designated downstream tasks. Nevertheless, current feature selection methods are limited by: 1) the selection criteria of these methods are varied for different domains, making them hard to generalize; 2) the selection performance of these approaches drops significantly when processing high-dimensional feature space coupled with small sample size. In light of these challenges, we pose the question: can selected feature subsets be more robust, accurate, and input dimensionality agnostic? In this paper, we reformulate the feature selection problem as a deep differentiable optimization task and propose a new research perspective: conceptualizing discrete feature subsetting as continuous embedding space optimization. We introduce a novel and principled framework that encompasses a sequential encoder, an accuracy evaluator, a sequential decoder, and a gradient ascent optimizer. This comprehensive framework includes four important steps: preparation of features-accuracy training data, deep feature subset embedding, gradient-optimized search, and feature subset reconstruction. Specifically, we utilize reinforcement feature selection learning to generate diverse and high-quality training data and enhance generalization. By optimizing reconstruction and accuracy losses, we embed feature selection knowledge into a continuous space using an encoder-evaluator-decoder model structure. We employ a gradient ascent search algorithm to find better embeddings in the learned embedding space. Furthermore, we reconstruct feature selection solutions using these embeddings and select the feature subset with the highest performance for downstream tasks as the optimal subset.

翻译：特征选择（包括过滤式、包裹式和嵌入式方法）的目标是为指定下游任务找到最优特征子集。然而，当前特征选择方法受限于：1）这些方法的选择标准因领域而异，难以泛化；2）当处理高维特征空间与小样本规模时，这些方法的选择性能显著下降。针对这些挑战，我们提出疑问：能否使所选特征子集更具鲁棒性、准确性且对输入维度不敏感？在本文中，我们将特征选择问题重新表述为深度可微优化任务，并提出一种新的研究视角：将离散特征子集概念化为连续嵌入空间优化。我们引入一个新颖且原则性的框架，包含序列编码器、精度评估器、序列解码器和梯度上升优化器。该综合框架包括四个重要步骤：特征-精度训练数据准备、深度特征子集嵌入、梯度优化搜索和特征子集重构。具体而言，我们利用强化特征选择学习生成多样且高质量的训练数据并增强泛化能力。通过优化重构损失和精度损失，我们使用编码器-评估器-解码器模型结构将特征选择知识嵌入连续空间。我们采用梯度上升搜索算法在学习到的嵌入空间中寻找更优的嵌入。此外，我们利用这些嵌入重构特征选择解，并选择下游任务中性能最高的特征子集作为最优子集。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日