While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutions using distributed/parallel paradigms. MapReduce solutions are proven to be one of the best approaches to designing fault-tolerant and scalable solutions. This work analyses the existing MapReduce approaches for mRMR feature selection and identifies the limitations thereof. In the current study, we proposed VMR_mRMR, an efficient vertical partitioning-based approach using a memorization approach, thereby overcoming the extant approaches limitations. The experiment analysis says that VMR_mRMR significantly outperformed extant approaches and achieved a better computational gain (C.G). In addition, we also conducted a comparative analysis with the horizontal partitioning approach HMR_mRMR [1] to assess the strengths and limitations of the proposed approach.
翻译:在构建机器学习模型时,特征选择作为重要的预处理步骤,用于处理数据中的不确定性和模糊性。近年来,最小冗余最大相关性方法已被证明在获取非冗余特征子集方面具有显著效果。由于海量数据集的生成,利用分布式/并行范式设计可扩展的解决方案至关重要。MapReduce解决方案已被证明是设计容错且可扩展方案的最佳途径之一。本文分析了现有基于MapReduce的mRMR特征选择方法,并指出其局限性。本研究提出VMR_mRMR——一种基于垂直分区并采用记忆化策略的高效方法,从而克服现有方法的不足。实验分析表明,VMR_mRMR显著优于现有方法,并获得了更优的计算增益。此外,我们还与水平分区方法HMR_mRMR[1]进行了对比分析,以评估所提方法的优势与局限。