Variable selection is a procedure to attain the truly important predictors from inputs. Complex nonlinear dependencies and strong coupling pose great challenges for variable selection in high-dimensional data. In addition, real-world applications have increased demands for interpretability of the selection process. A pragmatic approach should not only attain the most predictive covariates, but also provide ample and easy-to-understand grounds for removing certain covariates. In view of these requirements, this paper puts forward an approach for transparent and nonlinear variable selection. In order to transparently decouple information within the input predictors, a three-step heuristic search is designed, via which the input predictors are grouped into four subsets: the relevant to be selected, and the uninformative, redundant, and conditionally independent to be removed. A nonlinear partial correlation coefficient is introduced to better identify the predictors which have nonlinear functional dependence with the response. The proposed method is model-free and the selected subset can be competent input for commonly used predictive models. Experiments demonstrate the superior performance of the proposed method against the state-of-the-art baselines in terms of prediction accuracy and model interpretability.
翻译:变量选择是从输入中获取真正重要预测变量的过程。复杂非线性依赖关系和强耦合性给高维数据中的变量选择带来了巨大挑战。此外,实际应用对选择过程的可解释性提出了更高要求。一种实用的方法不仅应能获得最具预测能力的协变量,还应提供充分且易于理解的依据来剔除某些协变量。针对这些需求,本文提出了一种透明且非线性的变量选择方法。为透明地解耦输入预测变量中的信息,我们设计了三步启发式搜索,通过该搜索将输入预测变量分为四组:待选的相关变量,以及待剔除的无信息变量、冗余变量和条件独立变量。引入非线性偏相关系数以更好地识别与响应变量具有非线性函数依赖关系的预测变量。所提方法无需预设模型,其选择的变量子集可作为常用预测模型的有效输入。实验表明,与当前最优基线方法相比,该方法在预测准确性和模型可解释性方面均表现优异。