How to accurately measure the relevance and redundancy of features is an age-old challenge in the field of feature selection. However, existing filter-based feature selection methods cannot directly measure redundancy for continuous data. In addition, most methods rely on manually specifying the number of features, which may introduce errors in the absence of expert knowledge. In this paper, we propose a non-parametric feature selection algorithm based on maximum inter-class variation and minimum redundancy, abbreviated as MVMR-FS. We first introduce supervised and unsupervised kernel density estimation on the features to capture their similarities and differences in inter-class and overall distributions. Subsequently, we present the criteria for maximum inter-class variation and minimum redundancy (MVMR), wherein the inter-class probability distributions are employed to reflect feature relevance and the distances between overall probability distributions are used to quantify redundancy. Finally, we employ an AGA to search for the feature subset that minimizes the MVMR. Compared with ten state-of-the-art methods, MVMR-FS achieves the highest average accuracy and improves the accuracy by 5% to 11%.
翻译:如何准确衡量特征的相关性与冗余性是特征选择领域长期存在的挑战。然而,现有的基于过滤式的特征选择方法无法直接衡量连续数据的冗余性。此外,多数方法依赖人工指定特征数量,在缺乏专家知识时可能引入误差。本文提出了一种基于最大类间变异与最小冗余的无参数特征选择算法,简称MVMR-FS。我们首先在特征上引入有监督与无监督的核密度估计,以捕捉其在类间分布与整体分布中的相似性与差异性。随后,我们提出最大类间变异与最小冗余(MVMR)准则,其中利用类间概率分布反映特征相关性,并通过整体概率分布之间的距离量化冗余性。最后,我们采用自适应遗传算法(AGA)搜索使MVMR最小化的特征子集。与十种前沿方法相比,MVMR-FS取得了最高平均准确率,并将准确率提升了5%至11%。