Selecting a minimal feature set that is maximally informative about a target variable is a central task in machine learning and statistics. Information theory provides a powerful framework for formulating feature selection algorithms -- yet, a rigorous, information-theoretic definition of feature relevancy, which accounts for feature interactions such as redundant and synergistic contributions, is still missing. We argue that this lack is inherent to classical information theory which does not provide measures to decompose the information a set of variables provides about a target into unique, redundant, and synergistic contributions. Such a decomposition has been introduced only recently by the partial information decomposition (PID) framework. Using PID, we clarify why feature selection is a conceptually difficult problem when approached using information theory and provide a novel definition of feature relevancy and redundancy in PID terms. From this definition, we show that the conditional mutual information (CMI) maximizes relevancy while minimizing redundancy and propose an iterative, CMI-based algorithm for practical feature selection. We demonstrate the power of our CMI-based algorithm in comparison to the unconditional mutual information on benchmark examples and provide corresponding PID estimates to highlight how PID allows to quantify information contribution of features and their interactions in feature-selection problems.
翻译:在机器学习与统计学中,选取一个对目标变量信息量最大的最小特征集是一项核心任务。信息论为特征选择算法的构建提供了强大的理论框架——然而,目前仍缺乏一个能严格定义特征相关性并解释冗余与协同贡献等特征交互作用的信息论定义。我们认为,这种缺失源于经典信息论本身无法提供将一组变量关于目标变量的信息分解为独特、冗余和协同贡献的度量方法。这种分解直到近期才由部分信息分解(PID)框架引入。借助PID,我们阐明了为何基于信息论的特征选择在概念上具有难度,并给出了特征相关性与冗余性在PID框架下的新定义。基于该定义,我们证明条件互信息(CMI)能在最大化相关性的同时最小化冗余性,并提出一种基于CMI的迭代式实用特征选择算法。通过与无条件互信息在基准测试例上的对比,我们展示了CMI算法的优势;同时通过相应的PID估计,揭示了PID如何量化特征选择问题中特征及其交互作用的信息贡献。