Knockoff-Guided Feature Selection via A Single Pre-trained Reinforced Agent

Feature selection prepares the AI-readiness of data by eliminating redundant features. Prior research falls into two primary categories: i) Supervised Feature Selection, which identifies the optimal feature subset based on their relevance to the target variable; ii) Unsupervised Feature Selection, which reduces the feature space dimensionality by capturing the essential information within the feature set instead of using target variable. However, SFS approaches suffer from time-consuming processes and limited generalizability due to the dependence on the target variable and downstream ML tasks. UFS methods are constrained by the deducted feature space is latent and untraceable. To address these challenges, we introduce an innovative framework for feature selection, which is guided by knockoff features and optimized through reinforcement learning, to identify the optimal and effective feature subset. In detail, our method involves generating "knockoff" features that replicate the distribution and characteristics of the original features but are independent of the target variable. Each feature is then assigned a pseudo label based on its correlation with all the knockoff features, serving as a novel metric for feature evaluation. Our approach utilizes these pseudo labels to guide the feature selection process in 3 novel ways, optimized by a single reinforced agent: 1). A deep Q-network, pre-trained with the original features and their corresponding pseudo labels, is employed to improve the efficacy of the exploration process in feature selection. 2). We introduce unsupervised rewards to evaluate the feature subset quality based on the pseudo labels and the feature space reconstruction loss to reduce dependencies on the target variable. 3). A new {\epsilon}-greedy strategy is used, incorporating insights from the pseudo labels to make the feature selection process more effective.

翻译：特征选择通过消除冗余特征来提升数据的AI就绪性。现有研究主要分为两类：i) 监督式特征选择，根据特征与目标变量的相关性确定最优特征子集；ii) 无监督式特征选择，通过捕捉特征集内的重要信息而非依赖目标变量来降低特征空间维度。然而，监督式特征选择方法因依赖目标变量及下游机器学习任务而面临耗时且泛化能力有限的问题；无监督式特征选择方法则受限于所提取的特征空间具有潜在不可追溯性。为解决上述挑战，我们提出一种创新框架，通过控制变量特征引导并结合强化学习优化，识别最优且有效的特征子集。具体而言，本方法通过生成"控制变量"特征来复现原始特征的分布特性与结构，但保持与目标变量的独立性。基于各特征与所有控制变量特征的相关性，我们为其分配伪标签，这成为特征评估的新度量。该方法利用这些伪标签通过三种创新方式引导特征选择过程，并由单一强化代理进行优化：1) 使用原始特征及其对应伪标签预训练的深度Q网络，用于提升特征选择探索过程的效率；2) 基于伪标签及特征空间重构损失引入无监督奖励机制，以评估特征子集质量同时降低对目标变量的依赖；3) 采用新型ε-贪心策略，融合伪标签知识使特征选择过程更高效。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日