A data-science pipeline to enable the Interpretability of Many-Objective Feature Selection

Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task. As a consequence, MOFS typically returns a large set of non-dominated solutions, which have to be assessed by the data scientist in order to proceed with the final choice. Given the multi-variate nature of the assessment, which may include criteria (e.g. fairness) not related to predictive accuracy, this step is often not straightforward and suffers from the lack of existing tools. For instance, it is common to make use of a tabular presentation of the solutions, which provide little information about the trade-offs and the relations between criteria over the set of solutions. This paper proposes an original methodology to support data scientists in the interpretation and comparison of the MOFS outcome by combining post-processing and visualisation of the set of solutions. The methodology supports the data scientist in the selection of an optimal feature subset by providing her with high-level information at three different levels: objectives, solutions, and individual features. The methodology is experimentally assessed on two feature selection tasks adopting a GA-based MOFS with six objectives (number of selected features, balanced accuracy, F1-Score, variance inflation factor, statistical parity, and equalised odds). The results show the added value of the methodology in the selection of the final subset of features.

翻译：多目标特征选择方法使用四个或更多目标来确定监督学习任务中特征子集的相关性。因此，多目标特征选择通常返回大量非支配解，数据科学家必须评估这些解以做出最终选择。由于评估涉及多变量性质，可能包含与预测准确性无关的准则（例如公平性），这一步骤往往并不直接，且缺乏现有工具的支持。例如，通常采用表格形式呈现解，但这对于解集上准则之间的权衡和关系提供的信息甚少。本文提出了一种原创方法，通过结合后处理与解集可视化，支持数据科学家解释和比较多目标特征选择的结果。该方法在三个不同层面（目标、解和单个特征）向数据科学家提供高层次信息，从而帮助其选择最优特征子集。该方法在两项特征选择任务上进行了实验评估，采用了基于遗传算法的六目标多目标特征选择（包含所选特征数量、平衡准确率、F1分数、方差膨胀因子、统计均等性和均等机会）。结果表明，该方法在最终特征子集的选择中具有显著附加价值。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日