Scalable mRMR feature selection to handle high dimensional datasets: Vertical partitioning based Iterative MapReduce framework

While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutions using distributed/parallel paradigms. MapReduce solutions are proven to be one of the best approaches to designing fault-tolerant and scalable solutions. This work analyses the existing MapReduce approaches for mRMR feature selection and identifies the limitations thereof. In the current study, we proposed VMR_mRMR, an efficient vertical partitioning-based approach using a memorization approach, thereby overcoming the extant approaches limitations. The experiment analysis says that VMR_mRMR significantly outperformed extant approaches and achieved a better computational gain (C.G). In addition, we also conducted a comparative analysis with the horizontal partitioning approach HMR_mRMR [1] to assess the strengths and limitations of the proposed approach.

翻译：在构建机器学习模型时，特征选择作为重要的预处理步骤，用于处理数据中的不确定性和模糊性。近年来，最小冗余最大相关性方法已被证明在获取非冗余特征子集方面具有显著效果。由于海量数据集的生成，利用分布式/并行范式设计可扩展的解决方案至关重要。MapReduce解决方案已被证明是设计容错且可扩展方案的最佳途径之一。本文分析了现有基于MapReduce的mRMR特征选择方法，并指出其局限性。本研究提出VMR_mRMR——一种基于垂直分区并采用记忆化策略的高效方法，从而克服现有方法的不足。实验分析表明，VMR_mRMR显著优于现有方法，并获得了更优的计算增益。此外，我们还与水平分区方法HMR_mRMR[1]进行了对比分析，以评估所提方法的优势与局限。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日