Feature selection is a crucial task in settings where data is high-dimensional or acquiring the full set of features is costly. Recent developments in neural network-based embedded feature selection show promising results across a wide range of applications. Concrete Autoencoders (CAEs), considered state-of-the-art in embedded feature selection, may struggle to achieve stable joint optimization, hurting their training time and generalization. In this work, we identify that this instability is correlated with the CAE learning duplicate selections. To remedy this, we propose a simple and effective improvement: Indirectly Parameterized CAEs (IP-CAEs). IP-CAEs learn an embedding and a mapping from it to the Gumbel-Softmax distributions' parameters. Despite being simple to implement, IP-CAE exhibits significant and consistent improvements over CAE in both generalization and training time across several datasets for reconstruction and classification. Unlike CAE, IP-CAE effectively leverages non-linear relationships and does not require retraining the jointly optimized decoder. Furthermore, our approach is, in principle, generalizable to Gumbel-Softmax distributions beyond feature selection.
翻译:特征选择在高维数据或获取全部特征代价高昂的场景中是一项关键任务。基于神经网络的嵌入式特征选择的最新发展在广泛应用中展现出前景。被视为嵌入式特征选择领域最先进技术的具体自编码器(CAE)可能难以实现稳定的联合优化,从而影响其训练时间和泛化能力。本研究指出,这种不稳定性与CAE学习重复选择相关。为解决这一问题,我们提出一种简单有效的改进方法:间接参数化具体自编码器(IP-CAE)。IP-CAE学习一个嵌入及其到Gumbel-Softmax分布参数的映射。尽管实现简单,IP-CAE在多个数据集上的重建和分类任务中,在泛化能力和训练时间两方面均展现出显著且一致的改进。与CAE不同,IP-CAE有效利用非线性关系,且无需重新训练联合优化的解码器。此外,我们的方法原则上可推广至特征选择之外的Gumbel-Softmax分布。