Feature selection prepares the AI-readiness of data by eliminating redundant features. Prior research falls into two primary categories: i) Supervised Feature Selection, which identifies the optimal feature subset based on their relevance to the target variable; ii) Unsupervised Feature Selection, which reduces the feature space dimensionality by capturing the essential information within the feature set instead of using target variable. However, SFS approaches suffer from time-consuming processes and limited generalizability due to the dependence on the target variable and downstream ML tasks. UFS methods are constrained by the deducted feature space is latent and untraceable. To address these challenges, we introduce an innovative framework for feature selection, which is guided by knockoff features and optimized through reinforcement learning, to identify the optimal and effective feature subset. In detail, our method involves generating "knockoff" features that replicate the distribution and characteristics of the original features but are independent of the target variable. Each feature is then assigned a pseudo label based on its correlation with all the knockoff features, serving as a novel metric for feature evaluation. Our approach utilizes these pseudo labels to guide the feature selection process in 3 novel ways, optimized by a single reinforced agent: 1). A deep Q-network, pre-trained with the original features and their corresponding pseudo labels, is employed to improve the efficacy of the exploration process in feature selection. 2). We introduce unsupervised rewards to evaluate the feature subset quality based on the pseudo labels and the feature space reconstruction loss to reduce dependencies on the target variable. 3). A new {\epsilon}-greedy strategy is used, incorporating insights from the pseudo labels to make the feature selection process more effective.
翻译:特征选择通过消除冗余特征来提升数据的AI就绪性。现有研究主要分为两类:i) 监督式特征选择,根据特征与目标变量的相关性确定最优特征子集;ii) 无监督式特征选择,通过捕捉特征集内的重要信息而非依赖目标变量来降低特征空间维度。然而,监督式特征选择方法因依赖目标变量及下游机器学习任务而面临耗时且泛化能力有限的问题;无监督式特征选择方法则受限于所提取的特征空间具有潜在不可追溯性。为解决上述挑战,我们提出一种创新框架,通过控制变量特征引导并结合强化学习优化,识别最优且有效的特征子集。具体而言,本方法通过生成"控制变量"特征来复现原始特征的分布特性与结构,但保持与目标变量的独立性。基于各特征与所有控制变量特征的相关性,我们为其分配伪标签,这成为特征评估的新度量。该方法利用这些伪标签通过三种创新方式引导特征选择过程,并由单一强化代理进行优化:1) 使用原始特征及其对应伪标签预训练的深度Q网络,用于提升特征选择探索过程的效率;2) 基于伪标签及特征空间重构损失引入无监督奖励机制,以评估特征子集质量同时降低对目标变量的依赖;3) 采用新型ε-贪心策略,融合伪标签知识使特征选择过程更高效。